libweb - A library for parsing the web¶
libweb is, simply, a parsing engine for the web. The goal of the libweb project is to provide a library capable of parsing the vast majority of consumable content on the web. libweb strives to maintain compatibility with current versions of Python, and specifically tests against Python 2.7 and Python 3.3+.
Documentation¶
User Guide¶
Introduction¶
libweb is a framework for interacting with web sites of all shapes and sizes. Parsers are included for the most common web formats, and new parsers are easy to add. libweb officially supports Python 2.7, Python 3.3, Python 3.4 and Python 3.5, making it easy to integrate into whatever your next web project may be.
Installation¶
Install from PyPI¶
python-libweb can be installed using pip3:
pip3 install libweb
Or, if you’re feeling adventurous, can be installed directly from github:
pip3 install git+https://github.com/HurricaneLabs/python-libweb.git
Source Code¶
libweb lives on GitHub, making the code easy to browse, download, fork, etc. Pull requests are always welcome! Also, please remember to star the project if it makes you happy.
Once you have cloned the repo or downloaded a tarball from GitHub, you can install libweb like this:
$ cd python-libweb
$ pip3 install .
Or, if you want to edit the code, first fork the main repo, clone the fork to your desktop, and then run the following to install it using symbolic linking, so that when you change your code, the changes will be automagically available to your app without having to reinstall the package:
$ cd python-libweb
$ pip install -e .
Did we mention we love pull requests? :)
Quickstart¶
If you haven’t done so already, please take a moment to install the libweb library before continuing.
Learning by Example¶
Here is a simple parser from libweb’s README, showing how to get started interacting with the web:
# spamhaus.py
from libweb.dns import DnsblService
conf = {
"rrname": "{target}.zen.spamhaus.org",
"rrtype": "A",
}
for result in DnsblService(opts={"target": "127.0.0.2"}, **conf):
print(result)
Then, to run the sample parser:
$ python3 spamhaus.py
OrderedDict([('name', '2.0.0.127.zen.spamhaus.org.'), ('type', 'A'), ('class', 'IN'), ('ttl', 60), ('rdata', '127.0.0.2')])
OrderedDict([('name', '2.0.0.127.zen.spamhaus.org.'), ('type', 'A'), ('class', 'IN'), ('ttl', 60), ('rdata', '127.0.0.10')])
OrderedDict([('name', '2.0.0.127.zen.spamhaus.org.'), ('type', 'A'), ('class', 'IN'), ('ttl', 60), ('rdata', '127.0.0.4')])
$
More Features¶
Here is a more involved example demonstrating the features available in all of the HTTP-based parsers:
# virustotal.py
import sys
from libweb.json import JsonService
conf = {
"url": "https://www.virustotal.com/vtapi/v2/ip-address/report",
"params": {
"ip": "{target}"
},
"auth": {
"name": "virustotal",
"params": ["apikey"]
},
"jsonpath": {
"url": "$.detected_urls[*].url",
"pdns": "$.resolutions[*]",
"asn": "$.asn",
"country": "$.country",
"as_owner": "$.as_owner",
}
}
creds = {
"virustotal": [sys.argv[1]],
}
opts = {
"target": sys.argv[2]
}
for result in JsonService(opts=opts, creds=creds, **conf):
print(result)
You will need a VirusTotal API key to run this sample. Feel free to borrow the key from our sister project, Machinae. You can run the sample like so:
$ python virustotal.py <apikey> 209.95.50.13
OrderedDict([('asn', '29854'), ('country', 'US'), ('as_owner', 'WestHost, Inc.'), ('pdns', {'hostname': 'us-newyorkcity.privateinternetaccess.com', 'last_resolved': '2016-03-13 00:00:00'})])
$
libweb Parsers¶
libweb.dns¶
DnsService¶
DnsblService¶
-
class
libweb.dns.
DnsblService
(creds=None, opts=None, **conf)[source]¶ A DNS-based service where the service options are reversed for use in a DNSBL
Keyword Arguments: -
get_rrname
(rrname)[source]¶ Formats the rrname using the options passed to the service. All options are split using ”.” as the separator and then the order reversed, as is required for DNSBL services (such as Spamhaus).
Parameters: rrname (str) – A string template for rendering the rrname to be requested
-
libweb.http¶
HttpService¶
-
class
libweb.http.
HttpService
(creds=None, opts=None, **conf)[source]¶ A simple service based on HTTP requests. This class should not be used directly
-
build_request
(url, method='GET', **kwargs)[source]¶ Apply request hooks to automatically transform request content
Override this if you need to customze the Request object generated.
-
get_auth
(auth)[source]¶ Find and apply authentication
Override this if you need to support additional styles of authentication
-
process_params
(orig_params)[source]¶ Process parameters into usable pieces.
Override this if you provide any config parameters that may require interpreation, such as the relatime parameter
-
session
¶ Return a requests Session object which sets a User-Agent header
-
libweb.json¶
libweb.regex¶
RegexService¶
Contribute to libweb¶
Thanks for your interest in the project! We welcome pull requests from developers of all skill levels. To get started, simply fork the master branch on GitHub to your personal account and then clone the fork into your development environment.
Steve McMaster (iamthemcmaster on Twitter) is the original creator of the libweb project, and currently maintains the project for Hurricane Labs.
Thanks!
Code style rules¶
Code style for the libweb project follows 3 simple rules:
- Our code should be readable and easy to follow.
- Our code should be well commented.
- Our code should be well tested.
Tox tests include coverage testing and certain code quality tests, and it is expected that any PR’s will maintain the same level of coverage and quality. No preference is given for line length, single vs double quotes, etc, as long as the code remains readable and understandable.
libweb License¶
The MIT License (MIT)
Copyright (c) 2016 Hurricane Labs, LLC
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.