Crwy说明文档¶
目录:
安装¶
运行环境¶
- Python2.7
- Works on Linux, Mac OSX
快速安装¶
pip install crwy
源码包安装¶
1. 从这里下载: https://codeload.github.com/wuyue92tree/crwy/zip/master
2. 下载完成后,解压缩并进入crwy包目录,执行如下命令
python setup.py install
若出现依赖包安装失败的情况,可先执行
pip install -r requirements.txt
安装成功便可开始你的crwy之旅了。
命令行工具介绍¶
开始¶
在终端中键入: crwy, 将在屏幕上看到如下显示
Crwy - no active project found!!!
Usage:
crwy <commands> [option] [args]
Avaliable Commands:
list list all spider in your project
runspider run a spider
startproject create a new project
createspider create a new spider
version show version
Use "crwy <command> -h" to see more info about a command
可以看到crwy支持list, runspider, startproject, createspider等命令,想知道它们都是怎么用的么?继续往下看吧。
startproject¶
该命令用以新建爬虫项目
crwy startproject spidertest
执行成功会得到如下返回:
Project start......enjoy^.^
那么该命令到底干了些什么呢?
spidertest
├── crwy.cfg 确认项目名称及settings所在目录
├── data 爬取结果存储目录(sqlite存储的默认路径)
│ ├── __init__.py
│ └── __init__.pyc
├── spidertest
│ └── settings.py 项目配置文件
└── src 爬虫所在目录
├── __init__.py
└── __init__.pyc
它建立了一个名为: spidertest的工程, 里面包含了爬虫将用到的配置(将在后面详细解释)。
createspider¶
该命令用以新建爬虫 添加”-h”参数,看该命令如何使用:
crwy createspider -h
执行成功会得到如下返回:
Usage: crwy createspider [option] [args]
Options:
-h, --help show this help message and exit
-l, --list list available spider template name
-p PREVIEW, --preview=PREVIEW
preview spider template
-t TEMPLATE, --tmpl=TEMPLATE
spider template
-n NAME, --name=NAME new spider name
- -l: 用以列举可用的爬虫模板
- -p: 用以查看模板代码
- -t: 用以指定继承的模板名称
- -n: 用以指定将要创建的爬虫的名称
例子:
crwy createspider -t basic -n basictest
便可在src目录中找到生成的相对应的爬虫程序 注意: 创建爬虫时需在项目根目录下(即:crwy.cfg文件所在目录),否则项目将创建失败。
runspider¶
该命令用以执行爬虫
添加”-h”参数,看该命令如何使用:
crwy createspider -h
Usage: crwy runspider [option] [args]
Options:
-h, --help show this help message and exit
-n NAME, --name=NAME spider name
-p PROCESS, --process=PROCESS
crawler by multi process
- -n: 用以指定将要执行的爬虫的名称
- -p: 用以控制程序采用多进程运行(-p参数后接进程数)
爬虫模板介绍¶
basic模板¶
basic模板包含最基本的抓取逻辑
- 下载: html_downloader
- 解析: html_parser
模板内容如下:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
from crwy.spider import Spider
class ${class_name}Spider(Spider):
def __init__(self):
Spider.__init__(self)
self.spider_name = '${spider_name}'
def crawler_${spider_name}(self):
try:
url = 'http://example.com'
response = self.html_downloader.download(url)
soups = self.html_parser.parser(response.content)
print(url)
print(soups)
self.logger.info('%s[%s] --> crawler success !!!' % (
self.spider_name, self.worker, self.func_name()))
except Exception as e:
self.logger.exception('%s[%s] --> %s' % (
self.spider_name, self.worker, e))
def run(self):
self.crawler_${spider_name}()
可以看到,继承了一个名为Spider的类,该类中封装了html_downloader下载器和html_parser解析器,详情请阅读Utils详解中的Html章节。
sqlite模板¶
sqlite模板将爬取数据存储到sqlite数据库中
- 下载: html_downloader
- 解析: html_parser
- 存储: sqlite
模板内容如下:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
from sqlalchemy import Integer, Column, String
from crwy.spider import Spider
from crwy.utils.sql.db import Database, Base
class Test(Base):
__tablename__ = "test"
id = Column(Integer, primary_key=True, unique=True)
title = Column(String(20))
url = Column(String(20))
class ${class_name}Spider(Spider):
def __init__(self):
Spider.__init__(self)
self.spider_name = '${spider_name}'
self.sql = Database('sqlite:///./data/test.db')
self.sql.init_table()
def crawler_${spider_name}(self):
try:
url = 'http://example.com'
response = self.html_downloader.download(url)
soups = self.html_parser.parser(response.content)
title = soups.find('title').text
item = Test(title=title.decode('utf-8'), url=url.decode('utf-8'))
self.sql.session.merge(item)
self.sql.session.commit()
print(url)
print(soups)
self.logger.info('%s[%s] --> crawler success !!!' % (
self.spider_name, self.worker))
except Exception as e:
self.logger.exception('%s[%s] --> %s' % (
self.spider_name, self.worker, e))
def run(self):
self.crawler_${spider_name}()
存储逻辑:
- 通过创建class继承Base类(该类继承自sqlalchemy的declarative_base)生成table
- 通过Database类连接sqlite数据库,执行init_table()创建数据表, Sqlite类是什么 Click 。
- 调用session.merge()存入相关数据,调用session.commit()使更改生效
queue模板¶
queue模块将待爬取页面加载到队列中,实时把控队列进度
- 寻找待爬取页面规则,将页面URL压入队列
- 从队列中取出一个URL
- 下载: html_downloader
- 解析: html_parser
模板内容如下:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
import Queue
from crwy.spider import Spider
queue = Queue.Queue()
class ${class_name}Spider(Spider):
def __init__(self):
Spider.__init__(self)
self.spider_name = '${spider_name}'
def crawler_${spider_name}(self):
while True:
try:
if not queue.empty():
url = 'http://example.com/%d' % queue.get()
response = self.html_downloader.download(url)
soups = self.html_parser.parser(response.content)
print(url)
print(soups)
print('Length of queue : %d' % queue.qsize())
else:
self.logger.info('%s[%s] --> crawler success !!!' % (
self.spider_name, self.worker))
sys.exit()
except Exception as e:
self.logger.exception('%s[%s] --> %s' % (
self.spider_name, self.worker, e))
continue
def run(self):
for i in range(1, 10):
queue.put(i)
self.crawler_${spider_name}()
队列为多线程提供好的入口。
redis_queue模板¶
redis_queue模板将队列持久化到redis服务器中,以解决服务器宕机导致任务丢失的问题
- 连接redis服务器: RedisQueue, 新建队列
- 寻找待爬取页面规则,将页面URL压入队列
- 从队列中取出一个URL
- 下载: html_downloader
- 解析: html_parser
模板内容如下:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
from crwy.spider import Spider
from crwy.utils.queue.RedisQueue import RedisQueue
from crwy.utils.filter.RedisSet import RedisSet
queue = RedisQueue('foo')
s_filter = RedisSet('foo')
class ${class_name}Spider(Spider):
def __init__(self):
Spider.__init__(self)
self.spider_name = '${spider_name}'
def crawler_${spider_name}(self):
while True:
try:
if not queue.empty():
url = 'http://example.com/%s' % queue.get()
if s_filter.sadd(url) is False:
print('You got a crawled url. %s' % url)
continue
response = self.html_downloader.download(url)
soups = self.html_parser.parser(response.content)
print(url)
print(soups)
print('Length of queue : %s' % queue.qsize())
else:
self.logger.info('%s[%s] --> crawler success !!!' % (
self.spider_name, self.worker))
sys.exit()
except Exception as e:
self.logger.exception('%s[%s] --> %s' % (
self.spider_name, self.worker, e))
continue
def add_queue(self):
for i in range(100):
queue.put(i)
print(queue.qsize())
def run(self):
try:
worker = sys.argv[4]
except :
print('No worker found!!!\n')
sys.exit()
if worker == 'crawler':
self.crawler_${spider_name}()
elif worker == 'add_queue':
self.add_queue()
elif worker == 'clean':
queue.clean()
s_filter.clean()
else:
print('Invalid worker <%s>!!!\n' % worker)
添加add_queue()方法,可实现在程序不中断的情况下,继续添加新的抓取目标。
Utils详解¶
Html¶
html_downloader¶
采用requests做为下载器引擎
本框架采用版本 2.12.0
- download(url, method=’GET’, timeout=60)
url: 目标网站URLmethod: 规定请求方式,默认为GETtimeout: 规定超时时间(默认为60)**kwargs: 与requests保持一致
- downloadFile(url, save_path=’./data/’)
url: 目标文件URLsave_path: 文件保存路径
requests传送门: http://www.python-requests.org/en/master/
html_parser¶
采用BeautifulSoup4做为解析器引擎
- parser(response)
解析UTF-8编码网页
- gbk_parser(response)
解析GBK编码网页
- jsonp_parser(response)
解析不规则json网页(key不带双引号),返回dict
beautifulsoup4传送门: https://www.crummy.com/software/BeautifulSoup/
Sql¶
db¶
采用sqlalchemy操作数据库 具体支持数据库,参考:http://docs.sqlalchemy.org/en/latest/core/engines.html
- __init__(db_url, **kwargs)
db_url为数据库地址
- init_table()
初始化数据库
- drop_table()
清空数据库
sqlalchemy传送门: http://www.sqlalchemy.org/
补充内容¶
Redis队列¶
如何优雅的将redis当成消息队列
- __init__(name, namespace=’queue’, **redis_kwargs)
name: 队列名称namespace: 命名空间(默认为queue)**redis_kwargs: redis模块初始化参数
- qsize()
- 返回队列长度
- empty()
- 队列为空时返回True
- put()
- 向队列中压入一条数据
- get()
- 从队列中取出一条数据
- get_nowait()
- 从队列中取出一条数据,不阻塞
- clean()
- 清空队列
代码如下:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# author: wuyue92tree@163.com
import redis
class RedisQueue(object):
"""Simple Queue with Redis Backend"""
def __init__(self, name, namespace='queue', **redis_kwargs):
"""The default connection parameters are: host='localhost', port=6379, db=0"""
self.__db = redis.Redis(**redis_kwargs)
self.key = '%s:%s' % (namespace, name)
def qsize(self):
"""Return the approximate size of the queue."""
return self.__db.llen(self.key)
def empty(self):
"""Return True if the queue is empty, False otherwise."""
return self.qsize() == 0
def put(self, item):
"""Put item into the queue."""
self.__db.rpush(self.key, item)
def get(self, block=True, timeout=None):
"""Remove and return an item from the queue.
If optional args block is true and timeout is None (the default), block
if necessary until an item is available."""
if block:
item = self.__db.blpop(self.key, timeout=timeout)
else:
item = self.__db.lpop(self.key)
if item:
item = item[1]
return item
def get_nowait(self):
"""Equivalent to get(False)."""
return self.get(False)
def clean(self):
"""Empty key"""
return self.__db.delete(self.key)
SSDB队列¶
如何优雅的将SSDB当成消息队列
- __init__(name, **ssdb_kwargs)
name: 队列名称**ssdb_kwargs: ssdb模块初始化参数
- qsize()
- 返回队列长度
- empty()
- 队列为空时返回True
- put()
- 向队列中压入一条数据
- get()
- 从队列中取出一条数据
- clean()
- 清空队列
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# author: wuyue92tree@163.com
import pyssdb
class SsdbQueue(object):
"""Simple Queue with SSDB Backend"""
def __init__(self, name, **ssdb_kwargs):
"""The default connection parameters are: host='localhost', port=8888"""
self.__db = pyssdb.Client(**ssdb_kwargs)
self.key = name
def qsize(self):
"""Return the approximate size of the queue."""
return self.__db.qsize(self.key)
def empty(self):
"""Return True if the queue is empty, False otherwise."""
return self.qsize() == 0
def put(self, item):
"""Put item into the queue."""
self.__db.qpush(self.key, item)
def get(self):
"""Remove and return an item from the queue.
If optional args block is true and timeout is None (the default), block
if necessary until an item is available."""
item = self.__db.qpop(self.key)
return item
def clean(self):
"""Empty key"""
return self.__db.qclear(self.key)
日志系统¶
日志采用配置文件的形式工作
eg: default_logger.conf
#logger.conf
###############################################
[loggers]
keys=root,fileLogger,rtLogger
[logger_root]
level=INFO
handlers=consoleHandler
[logger_fileLogger]
handlers=consoleHandler,fileHandler
qualname=fileLogger
propagate=0
[logger_rtLogger]
handlers=consoleHandler,rtHandler
qualname=rtLogger
propagate=0
###############################################
[handlers]
keys=consoleHandler,fileHandler,rtHandler
[handler_consoleHandler]
class=StreamHandler
level=INFO
formatter=simpleFmt
args=(sys.stderr,)
[handler_fileHandler]
class=FileHandler
level=DEBUG
formatter=defaultFmt
args=('./log/default.log', 'a')
[handler_rtHandler]
class=handlers.RotatingFileHandler
level=DEBUG
formatter=defaultFmt
args=('./log/default.log', 'a', 10*1024*1024, 5)
###############################################
[formatters]
keys=defaultFmt,simpleFmt
[formatter_defaultFmt]
format=%(asctime)s %(filename)s %(funcName)s [line:%(lineno)d] %(levelname)s %(message)s
datefmt=%Y-%m-%d %H:%M:%S
[formatter_simpleFmt]
format=%(name)-12s: %(levelname)-8s %(message)s
datefmt=
FAQ手册¶
整装待发......