Welcome to eWRT - Extensible Web Retrieval Toolkit’s documentation!¶
Knowledge capture in the age of massive Web data requires robust and scalable mechanisms to acquire, consolidate and pre-process large amounts of heterogeneous data. The Extensible Web Retrieval Toolkit (eWRT) is modular open-source Python API that addresses this requirement. It retrieves social data from Web sources such as Delicious, Flickr, Yahoo! and Wikipedia, including various helper classes for effective caching and data management. The toolkit provides components for content acquisition and caching, low-level natural language processing functionalities such as language detection, phonetic string similarity measures, and methods for string normalization.
eWRT has been jointly developed by researchers from MODUL University Vienna, webLyzard technology, the University of Applied Sciences Chur, and the Vienna University of Economics and Business. The library is currently being extended as part of the uComp Project, which investigates Embedded Human Computation for Knowledge Extraction and Evaluation.
Checkout: eWRT on gitweb.
Contents:
Installation¶
todo: Finish this part
Access to the Repository
Clone the eWRT git repository::
git clone http://git.semanticlab.net/eWRT.git
Or directly install using pip:
pip install git+http://git.semanticlab.net/eWRT.git
Dependencies
- python 2.6 or higher
- python-nose >=0.11
eWRT Package¶
Subpackages¶
access Package¶
db
Module¶
file
Module¶
Convenient methods for file access
@package eWRT.access.file Created on Dec 6, 2012
@author: albert
-
class
eWRT.access.file.
CompressedFile
(fname, mode='rb')[source]¶ Bases:
object
An intelligent file object that transparently opens compressed files.
-
COMPRESSION_EXT
= ('bz2', 'gz')¶
-
http
Module¶
config Package¶
config
Package¶
@package eWRT.config evaluates ~/.eWRT/siteconfig.py and publishes the values
input Package¶
Subpackages¶
conv Package¶
conv
Package¶@package eWRT.input.conv
Classes for converting various input formats into each other.
doc
Module¶@package eWRT.input.conv.doc converts Microsoft Word documents into text
html
Module¶@package eWRT.input.conv.html converts HTML pages into text
pdf
Module¶@package eWRT.input.conv.pdf converts PDF documents into text
lib Package¶
lib
Package¶
Result
Module¶
ResultSet
Module¶
Webservice
Module¶
apihelber
Module¶
Subpackages¶
thirdparty Package¶
advas
Package¶source code from the AdvaS Advanced Search project version: 0.2.3
phonetics
Module¶-
eWRT.lib.thirdparty.advas.phonetics.
caverphone
(term)[source]¶ returns the language key using the caverphone algorithm 2.0
-
eWRT.lib.thirdparty.advas.phonetics.
metaphone
(term)[source]¶ returns metaphone code for a given string
ontology Package¶
ontology
Package¶
@package eWRT.ontology.eval Evaluates ontologies based on their _internal_ characteristics such as term coherence, internal integrity, ...
For evaluations against a reference ontology @see eWRT.ontology.compare
stat Package¶
Subpackages¶
coherence Package¶
coherence
Package¶@package eWRT.ws.stat.coherence Determines how strongly two terms are connected to each other
-
class
eWRT.stat.coherence.
Coherence
(dataSource, cache=True)[source]¶ Bases:
object
@class Coherence abstract class for computing the coherence between terms
-
class
eWRT.stat.coherence.
DiceCoherence
(dataSource, cache=True)[source]¶ Bases:
eWRT.stat.coherence.Coherence
@class DiceCoherence computes the dice coherence for the given terms
-
class
eWRT.stat.coherence.
PMICoherence
(dataSource, cache=True)[source]¶ Bases:
eWRT.stat.coherence.Coherence
@class PMICoherence computes the coherence based on the pointwise mutual information (PMI)
eval Package¶
metrics
Module¶@package eWRT.ws.stat.eval.metrics Standard IR evaluation metrics such as
- precision
- recall
- F1 measure
-
class
eWRT.stat.eval.metrics.
TestEvaluationMetrics
[source]¶ Bases:
object
tests the evaluation metrics
-
a
= set([8, 1, 3, 9])¶
-
b
= set([1, 10, 3, 12])¶
-
c
= set([1, 3])¶
-
d
= set([10, 11])¶
-
-
eWRT.stat.eval.metrics.
fMeasure
(p, r, beta=1.0)[source]¶ returns the F-measure for the given precision and recall @param[in] p precision @param[in] r recall @param[in] beta weight used to compute the f mesure @returns the F-Measure
language Package¶
language
Package¶@package eWRT.stat.language
language detection
-
class
eWRT.stat.language.
DetectLanguageTest
(methodName='runTest')[source]¶ Bases:
unittest.case.TestCase
-
eWRT.stat.language.
detect_language
(text)[source]¶ detects the most probable language for the given text
-
eWRT.stat.language.
get_lang_name
(fname)¶
util Package¶
advLogging
Module¶
-
class
eWRT.util.advLogging.
SNMPHandler
(moduleName)[source]¶ Bases:
logging.Handler
Logging handler for sending SNMP traps
assert
Module¶
@package eWRT.util.assert Assertion based counters
Examples: see unittests
async
Module¶
@package eWRT.util.async asynchronous procedure calls
@warning this library is still a draft and might change considerable
-
class
eWRT.util.async.
Async
(cache_dir, cache_nesting_level=0, cache_file_suffix='', max_processes=8, debug_dir=None)[source]¶ Bases:
eWRT.util.cache.DiskCache
Asynchronous Call Handling
-
getPostHashfile
(cmd)[source]¶ returns an identifier representing the object which is compatible to the identifiers returned by the eWRT.util.cache.* classes.
-
cache
Module¶
@package eWRT.util.cache caches arbitrary objects
-
class
eWRT.util.cache.
Cache
(fn=None)[source]¶ Bases:
object
An abstract class for caching functions
-
fetch
(fetch_function, *args, **kargs)[source]¶ Fetches a object from the cache or computes it by calling the fetch_function. The objectId is computed based on the function arguments
-
-
class
eWRT.util.cache.
DiskCache
(cache_dir, cache_nesting_level=0, cache_file_suffix='', fn=None)[source]¶ Bases:
eWRT.util.cache.Cache
@class DiskCache Caches abitrary functions based on the function’s arguments (fetch) or on a user defined key (fetchObjectId)
@remarks This version of DiskCached is threadsafe
-
fetch
(fetch_function, *args, **kargs)[source]¶ - fetches the object with the given id, querying
- the cache and
- the fetch_function
if the fetch_function is called, the functions result is saved in the cache
::param fetch_function: function to call if the result is not in the cache ::param args: arguments ::param kargs: optional keyword arguments
::returns: the object (retrieved from the cache or computed)
-
fetchObjectId
(key, fetch_function, *args, **kargs)[source]¶ - fetches the object with the given id, querying
- the cache and
- the fetch_function
if the fetch_function is called, the functions result is saved in the cache
::param key: key to fetch ::param fetch_function: function to call if the result is not in the cache ::param args: arguments ::param kargs: optional keyword arguments
::returns: the object (retrieved from the cache or computed)
-
-
class
eWRT.util.cache.
DiskCached
(cache_dir, cache_nesting_level=0, cache_file_suffix='')[source]¶ Bases:
object
Decorator based on Cache for caching arbitrary function calls usage:
@DiskCached(”./cache/myfunction”) def myfunction(*args):@remarks This version of DiskCached is threadsafe
-
cache
¶
-
-
class
eWRT.util.cache.
IterableCache
(cache_dir, cache_nesting_level=0, cache_file_suffix='', fn=None)[source]¶ Bases:
eWRT.util.cache.DiskCache
caches arbitrary iterable content identified by an identifier
-
fetchObjectId
(key, function, *args, **kargs)[source]¶ - fetches the object with the given id, querying
- the cache and
- the function
if the function is called, the functions result is saved in the cache
::param key: key to fetch ::param function: function to call if the result is not in the cache ::param args: arguments ::param kargs: optional keyword arguments
::returns: the object (retrieved from the cache or computed)
-
-
class
eWRT.util.cache.
MemoryCache
(max_cache_size=0, fn=None)[source]¶ Bases:
eWRT.util.cache.Cache
@class MemoryCached
Caches abitrary functions based on the function’s arguments (fetch) or on a user defined key (fetchObjectId)
-
max_cache_size
¶
-
-
class
eWRT.util.cache.
MemoryCached
(arg)[source]¶ Bases:
eWRT.util.cache.MemoryCache
Decorator based on MemoryCache for caching arbitrary function calls usage:
@MemoryCached or @MemoryCached(max_cache_size) def myfunction(*args): ...
-
class
eWRT.util.cache.
RedisCache
(max_cache_size=0, fn=None, host='localhost', port=6379, db=0)[source]¶ Bases:
eWRT.util.cache.Cache
-
class
eWRT.util.cache.
RedisCached
(arg)[source]¶ Bases:
eWRT.util.cache.RedisCache
Decorator based on MemoryCache for caching arbitrary function calls usage:
@MemoryCached or @MemoryCached(max_cache_size) def myfunction(*args): ...
-
eWRT.util.cache.
get_unique_temp_file
(fname)¶
exception
Module¶
@package eWRT.util.exception
-
exception
eWRT.util.exception.
SNMPException
(module_name, msg, level='warning')[source]¶ Bases:
exceptions.Exception
reports an exception to snmp
execute
Module¶
@package eWRT.util.execute Helpers for executing third party modules
monitoring
Module¶
The class NSCA-Service helps the enables to send test-results over the NSCA service to Nagios. This is a more reliable way of sending messages directly from programs than with SNMP. On the one hand because they are not only associated to single service, but also as the configuration is easier
== Configuration ==
- Install the package nsca
aptitude install nsca
- On the monitoring system:
** edit the file ‘’/etc/nsca.cfg’‘: ** set a password and an appropriate encryption method
- On the host:
** edit the file ‘’/etc/send_nsca.cfg’’ ** enter the above password and encryption method
pickleIterator
Module¶
pickelIterator
-
class
eWRT.util.pickleIterator.
AbstractIterator
(fname, file_mode=None)[source]¶ Bases:
object
Abstract Iterator class used to implement ReadPickleIterator and WritePickleIterator
-
class
eWRT.util.pickleIterator.
ReadPickleIterator
(fname)[source]¶ Bases:
eWRT.util.pickleIterator.AbstractIterator
provides an iterator over pickeled elements
-
class
eWRT.util.pickleIterator.
WritePickleIterator
(fname)[source]¶ Bases:
eWRT.util.pickleIterator.AbstractIterator
writes pickeled elements (available as iterator) to a file
profile
Module¶
@package eWRT.util.profile google like profiling :)
@warning this library is still a draft and might change considerable
ws Package¶
ws
Package¶
@package eWRT.ws Web Service access.
-
class
eWRT.ws.
AbstractIterableWebSource
[source]¶ Bases:
eWRT.ws.AbstractWebSource
web source implementing several calls to the API iterating over search terms and over API-specific maximal number of results restriction
-
DEFAULT_COMMAND
= None¶
-
DEFAULT_FORMAT
= None¶
-
DEFAULT_MAX_RESULTS
= None¶
-
DEFAULT_START_INDEX
= None¶
-
RESULT_PATH
= None¶
-
invoke_iterator
(search_terms, max_results, from_date=None, to_date=None, command=None, output_format=None)[source]¶ iterator: iterates over search terms and API requests
-
process_output
(results, path)[source]¶ results’ post-processor: iterates over the API responses and calls the output convertor
-
TagInfoService
Module¶
-
class
eWRT.ws.TagInfoService.
TagInfoService
[source]¶ Bases:
object
Class for fetching assigned tags
WebDataSource
Module¶
Features¶
eWRT provides the following features:
- content acquisition components for the Web (see
eWRT.access.http
) and different social media sources (seeeWRT.ws
) - low-level natural language processing, e.g.
- language detection (
eWRT.stat.language
)
- language detection (
- content caching (see
eWRT.util.cache
)