Crwy说明文档

目录:

安装

运行环境

  • Python2.7
  • Works on Linux, Mac OSX

快速安装

pip install crwy

源码包安装

1. 从这里下载: https://codeload.github.com/wuyue92tree/crwy/zip/master

2. 下载完成后,解压缩并进入crwy包目录,执行如下命令

python setup.py install

若出现依赖包安装失败的情况,可先执行

pip install -r requirements.txt

安装成功便可开始你的crwy之旅了。

命令行工具介绍

开始

在终端中键入: crwy, 将在屏幕上看到如下显示

Crwy - no active project found!!!

Usage:
  crwy <commands> [option] [args]

Avaliable Commands:
  list              list all spider in your project
  runspider         run a spider
  startproject      create a new project
  createspider      create a new spider
  version           show version

Use "crwy <command> -h" to see more info about a command

可以看到crwy支持list, runspider, startproject, createspider等命令,想知道它们都是怎么用的么?继续往下看吧。

startproject

该命令用以新建爬虫项目

crwy startproject spidertest

执行成功会得到如下返回:

Project start......enjoy^.^

那么该命令到底干了些什么呢?

spidertest
├── crwy.cfg            确认项目名称及settings所在目录
├── data                爬取结果存储目录(sqlite存储的默认路径)
│   ├── __init__.py
│   └── __init__.pyc
├── spidertest
│   └── settings.py     项目配置文件
└── src                 爬虫所在目录
    ├── __init__.py
    └── __init__.pyc

它建立了一个名为: spidertest的工程, 里面包含了爬虫将用到的配置(将在后面详细解释)。

createspider

该命令用以新建爬虫 添加”-h”参数,看该命令如何使用:

crwy createspider -h

执行成功会得到如下返回:

Usage:  crwy createspider [option] [args]

Options:
  -h, --help            show this help message and exit
  -l, --list            list available spider template name
  -p PREVIEW, --preview=PREVIEW
                        preview spider template
  -t TEMPLATE, --tmpl=TEMPLATE
                        spider template
  -n NAME, --name=NAME  new spider name
  • -l: 用以列举可用的爬虫模板
  • -p: 用以查看模板代码
  • -t: 用以指定继承的模板名称
  • -n: 用以指定将要创建的爬虫的名称

例子:

crwy createspider -t basic -n basictest

便可在src目录中找到生成的相对应的爬虫程序 注意: 创建爬虫时需在项目根目录下(即:crwy.cfg文件所在目录),否则项目将创建失败。

list

该命令用以显示爬虫列表

crwy list

runspider

该命令用以执行爬虫

添加”-h”参数,看该命令如何使用:

crwy createspider -h
Usage:  crwy runspider [option] [args]

Options:
  -h, --help            show this help message and exit
  -n NAME, --name=NAME  spider name
  -p PROCESS, --process=PROCESS
                        crawler by multi process
  • -n: 用以指定将要执行的爬虫的名称
  • -p: 用以控制程序采用多进程运行(-p参数后接进程数)

version

该命令用以查看crwy版本号

crwy version

爬虫模板介绍

basic模板

basic模板包含最基本的抓取逻辑

  • 下载: html_downloader
  • 解析: html_parser

模板内容如下:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import print_function

from crwy.spider import Spider


class ${class_name}Spider(Spider):
    def __init__(self):
        Spider.__init__(self)
        self.spider_name = '${spider_name}'

    def crawler_${spider_name}(self):
        try:
            url = 'http://example.com'
            response = self.html_downloader.download(url)
            soups = self.html_parser.parser(response.content)
            print(url)
            print(soups)
            self.logger.info('%s[%s] --> crawler success !!!' % (
                self.spider_name, self.worker, self.func_name()))

        except Exception as e:
            self.logger.exception('%s[%s] --> %s' % (
                self.spider_name, self.worker, e))

    def run(self):
        self.crawler_${spider_name}()

可以看到,继承了一个名为Spider的类,该类中封装了html_downloader下载器和html_parser解析器,详情请阅读Utils详解中的Html章节。

sqlite模板

sqlite模板将爬取数据存储到sqlite数据库中

  • 下载: html_downloader
  • 解析: html_parser
  • 存储: sqlite

模板内容如下:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import print_function

from sqlalchemy import Integer, Column, String
from crwy.spider import Spider
from crwy.utils.sql.db import Database, Base


class Test(Base):
    __tablename__ = "test"
    id = Column(Integer, primary_key=True, unique=True)
    title = Column(String(20))
    url = Column(String(20))


class ${class_name}Spider(Spider):
    def __init__(self):
        Spider.__init__(self)
        self.spider_name = '${spider_name}'
        self.sql = Database('sqlite:///./data/test.db')
        self.sql.init_table()

    def crawler_${spider_name}(self):
        try:
            url = 'http://example.com'
            response = self.html_downloader.download(url)
            soups = self.html_parser.parser(response.content)
            title = soups.find('title').text
            item = Test(title=title.decode('utf-8'), url=url.decode('utf-8'))
            self.sql.session.merge(item)
            self.sql.session.commit()
            print(url)
            print(soups)
            self.logger.info('%s[%s] --> crawler success !!!' % (
                self.spider_name, self.worker))

        except Exception as e:
            self.logger.exception('%s[%s] --> %s' % (
                self.spider_name, self.worker, e))

    def run(self):
        self.crawler_${spider_name}()

存储逻辑:

  1. 通过创建class继承Base类(该类继承自sqlalchemy的declarative_base)生成table
  2. 通过Database类连接sqlite数据库,执行init_table()创建数据表, Sqlite类是什么 Click
  3. 调用session.merge()存入相关数据,调用session.commit()使更改生效

queue模板

queue模块将待爬取页面加载到队列中,实时把控队列进度

  • 寻找待爬取页面规则,将页面URL压入队列
  • 从队列中取出一个URL
  • 下载: html_downloader
  • 解析: html_parser

模板内容如下:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import print_function

import sys
import Queue
from crwy.spider import Spider

queue = Queue.Queue()


class ${class_name}Spider(Spider):
    def __init__(self):
        Spider.__init__(self)
        self.spider_name = '${spider_name}'

    def crawler_${spider_name}(self):
        while True:
            try:
                if not queue.empty():
                    url = 'http://example.com/%d' % queue.get()
                    response = self.html_downloader.download(url)
                    soups = self.html_parser.parser(response.content)
                    print(url)
                    print(soups)
                    print('Length of queue : %d' % queue.qsize())
                else:
                    self.logger.info('%s[%s] --> crawler success !!!' % (
                        self.spider_name, self.worker))
                    sys.exit()

            except Exception as e:
                self.logger.exception('%s[%s] --> %s' % (
                    self.spider_name, self.worker, e))
                continue

    def run(self):
        for i in range(1, 10):
            queue.put(i)

        self.crawler_${spider_name}()

队列为多线程提供好的入口。

redis_queue模板

redis_queue模板将队列持久化到redis服务器中,以解决服务器宕机导致任务丢失的问题

  • 连接redis服务器: RedisQueue, 新建队列
  • 寻找待爬取页面规则,将页面URL压入队列
  • 从队列中取出一个URL
  • 下载: html_downloader
  • 解析: html_parser

模板内容如下:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from __future__ import print_function

import sys
from crwy.spider import Spider
from crwy.utils.queue.RedisQueue import RedisQueue
from crwy.utils.filter.RedisSet import RedisSet


queue = RedisQueue('foo')
s_filter = RedisSet('foo')


class ${class_name}Spider(Spider):
    def __init__(self):
        Spider.__init__(self)
        self.spider_name = '${spider_name}'

    def crawler_${spider_name}(self):
        while True:
            try:
                if not queue.empty():
                    url = 'http://example.com/%s' % queue.get()
                    if s_filter.sadd(url) is False:
                        print('You got a crawled url. %s' % url)
                        continue
                    response = self.html_downloader.download(url)
                    soups = self.html_parser.parser(response.content)
                    print(url)
                    print(soups)
                    print('Length of queue : %s' % queue.qsize())
                else:
                    self.logger.info('%s[%s] --> crawler success !!!' % (
                        self.spider_name, self.worker))
                    sys.exit()

            except Exception as e:
                self.logger.exception('%s[%s] --> %s' % (
                    self.spider_name, self.worker, e))
                continue

    def add_queue(self):
        for i in range(100):
            queue.put(i)
        print(queue.qsize())

    def run(self):
        try:
            worker = sys.argv[4]
        except :
            print('No worker found!!!\n')
            sys.exit()

        if worker == 'crawler':
            self.crawler_${spider_name}()
        elif worker == 'add_queue':
            self.add_queue()
        elif worker == 'clean':
            queue.clean()
            s_filter.clean()
        else:
            print('Invalid worker <%s>!!!\n' % worker)

添加add_queue()方法,可实现在程序不中断的情况下,继续添加新的抓取目标。

Utils详解

Html

html_downloader

采用requests做为下载器引擎

本框架采用版本 2.12.0

  • download(url, method=’GET’, timeout=60)
url: 目标网站URL
method: 规定请求方式,默认为GET
timeout: 规定超时时间(默认为60)
**kwargs: 与requests保持一致
  • downloadFile(url, save_path=’./data/’)
url: 目标文件URL
save_path: 文件保存路径

requests传送门: http://www.python-requests.org/en/master/

html_parser

采用BeautifulSoup4做为解析器引擎

  • parser(response)
解析UTF-8编码网页
  • gbk_parser(response)
解析GBK编码网页
  • jsonp_parser(response)
解析不规则json网页(key不带双引号),返回dict

beautifulsoup4传送门: https://www.crummy.com/software/BeautifulSoup/

Sql

db

采用sqlalchemy操作数据库 具体支持数据库,参考:http://docs.sqlalchemy.org/en/latest/core/engines.html

  • __init__(db_url, **kwargs)
db_url为数据库地址
  • init_table()
初始化数据库
  • drop_table()
清空数据库

sqlalchemy传送门: http://www.sqlalchemy.org/

补充内容

Redis队列

如何优雅的将redis当成消息队列

  • __init__(name, namespace=’queue’, **redis_kwargs)
name: 队列名称
namespace: 命名空间(默认为queue)
**redis_kwargs: redis模块初始化参数
  • qsize()
    返回队列长度
  • empty()
    队列为空时返回True
  • put()
    向队列中压入一条数据
  • get()
    从队列中取出一条数据
  • get_nowait()
    从队列中取出一条数据,不阻塞
  • clean()
    清空队列

代码如下:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# author: wuyue92tree@163.com

import redis


class RedisQueue(object):
    """Simple Queue with Redis Backend"""

    def __init__(self, name, namespace='queue', **redis_kwargs):
        """The default connection parameters are: host='localhost', port=6379, db=0"""
        self.__db = redis.Redis(**redis_kwargs)
        self.key = '%s:%s' % (namespace, name)

    def qsize(self):
        """Return the approximate size of the queue."""
        return self.__db.llen(self.key)

    def empty(self):
        """Return True if the queue is empty, False otherwise."""
        return self.qsize() == 0

    def put(self, item):
        """Put item into the queue."""
        self.__db.rpush(self.key, item)

    def get(self, block=True, timeout=None):
        """Remove and return an item from the queue.

        If optional args block is true and timeout is None (the default), block
        if necessary until an item is available."""
        if block:
            item = self.__db.blpop(self.key, timeout=timeout)
        else:
            item = self.__db.lpop(self.key)

        if item:
            item = item[1]
        return item

    def get_nowait(self):
        """Equivalent to get(False)."""
        return self.get(False)

    def clean(self):
        """Empty key"""
        return self.__db.delete(self.key)

SSDB队列

如何优雅的将SSDB当成消息队列

  • __init__(name, **ssdb_kwargs)
name: 队列名称
**ssdb_kwargs: ssdb模块初始化参数
  • qsize()
    返回队列长度
  • empty()
    队列为空时返回True
  • put()
    向队列中压入一条数据
  • get()
    从队列中取出一条数据
  • clean()
    清空队列
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# author: wuyue92tree@163.com

import pyssdb


class SsdbQueue(object):
    """Simple Queue with SSDB Backend"""

    def __init__(self, name, **ssdb_kwargs):
        """The default connection parameters are: host='localhost', port=8888"""
        self.__db = pyssdb.Client(**ssdb_kwargs)
        self.key = name

    def qsize(self):
        """Return the approximate size of the queue."""
        return self.__db.qsize(self.key)

    def empty(self):
        """Return True if the queue is empty, False otherwise."""
        return self.qsize() == 0

    def put(self, item):
        """Put item into the queue."""
        self.__db.qpush(self.key, item)

    def get(self):
        """Remove and return an item from the queue.

        If optional args block is true and timeout is None (the default), block
        if necessary until an item is available."""

        item = self.__db.qpop(self.key)

        return item

    def clean(self):
        """Empty key"""
        return self.__db.qclear(self.key)

日志系统

日志采用配置文件的形式工作

eg: default_logger.conf

#logger.conf
###############################################
[loggers]
keys=root,fileLogger,rtLogger

[logger_root]
level=INFO
handlers=consoleHandler

[logger_fileLogger]
handlers=consoleHandler,fileHandler
qualname=fileLogger
propagate=0

[logger_rtLogger]
handlers=consoleHandler,rtHandler
qualname=rtLogger
propagate=0

###############################################
[handlers]
keys=consoleHandler,fileHandler,rtHandler

[handler_consoleHandler]
class=StreamHandler
level=INFO
formatter=simpleFmt
args=(sys.stderr,)

[handler_fileHandler]
class=FileHandler
level=DEBUG
formatter=defaultFmt
args=('./log/default.log', 'a')

[handler_rtHandler]
class=handlers.RotatingFileHandler
level=DEBUG
formatter=defaultFmt
args=('./log/default.log', 'a', 10*1024*1024, 5)

###############################################

[formatters]
keys=defaultFmt,simpleFmt

[formatter_defaultFmt]
format=%(asctime)s %(filename)s %(funcName)s [line:%(lineno)d] %(levelname)s %(message)s
datefmt=%Y-%m-%d %H:%M:%S

[formatter_simpleFmt]
format=%(name)-12s: %(levelname)-8s %(message)s
datefmt=

FAQ手册

整装待发......