WA-KAT documentation

WA-KAT is a project that simplifies the job of curators of the Webarchive of National Library of the Czech Republic by cataloging electronic resources using semi-automatic analysis.

Project is written in Python as single page bottle.py application.

Component documentation

Here is programmer documentation of all components.

wa_kat:

logger

data_model

settings

Module is containing all necessary global variables for the package.

Module also has the ability to read user-defined data from following paths:

  • SETTINGS_PATH env variable file pointer to .json file.
  • $HOME/_SETTINGS_PATH
  • /etc/_SETTINGS_PATH

See _SETTINGS_PATH for details.

Note

When the first path is found, others is ignored.

Example of the configuration file ($HOME/webarchive/wa_kat.json):

{
    "WEB_ADDR": "somedomain.cz",
    "WEB_PORT": 80
}

Example of starting the program with env variable:

export WA_KAT_SETTINGS="/tmp/conf.json"; bin/wa_kat_server.py

Attributes

wa_kat.settings.DB_CACHE_TIME = 1800

ZEO cache time - 30 minutes.

wa_kat.settings.DB_MAX_WAIT_TIME = 300

Time after which the processing is restarted. 5m

wa_kat.settings.WEB_ADDR = '0.0.0.0'

Address where the webserver should listen.

wa_kat.settings.WEB_PORT = 8080

Port for the webserver.

wa_kat.settings.WEB_SERVER = 'paste'

Use paste for threading.

wa_kat.settings.WEB_DEBUG = False

Turn on web debug messages?

wa_kat.settings.WEB_RELOADER = False

Turn on reloader for webserver?

wa_kat.settings.WEB_BE_QUIET = False

Be quiet and don’t emit debug messages to terminal.

wa_kat.settings.SEEDER_INFO_URL = 'http://seeder.visgean.me/api/source/%s/'

Settings for the Seeder API

wa_kat.settings.SEEDER_TOKEN = ''

Token for the Seeder API

wa_kat.settings.SEEDER_TIMEOUT = 5

How long to wait for the Seeder.

wa_kat.settings.TIMEOUT_MESSAGE = 'Po\xc5\xbeadovanou str\xc3\xa1nku nebylo mo\xc5\xben\xc3\xa9 st\xc3\xa1hnout. Zkuste zadat URL s www.'

Error message in case that the analysis timeouted.

wa_kat.settings.REQUEST_TIMEOUT = 5

How long to wait until the analysed page loads.

wa_kat.settings.GUI_TO_REST_PERIODE = 2

How often check the REST API.

wa_kat.settings.WHOIS_URL = 'http://whois.icann.org/en/lookup?name=%s'

Whois URL

wa_kat.settings.NTK_ALEPH_URL = 'http://aleph.techlib.cz/X'

URL to the NTK aleph.

wa_kat.settings.USER_AGENT = 'http://webarchiv.cz catalogization tool WA-KAT.'

User agent used in analysis.

wa_kat.settings.API_PATH = '/api_v1/'

Path for the REST API.

wa_kat.settings.ERROR_LOG_PATH = '/tmp/wa-kat.log'

Path to the local logfile.

wa_kat.settings.LOG_UDP_ADDR = 'kitakitsune.org'

Address of the log server.

wa_kat.settings.LOG_UDP_PORT = 32000

Port for logging.

wa_kat.settings.SENTRY_DSN = ''

Sentry DSN string. You may find this in Sentry.

wa_kat.settings.LOG_TO_FILE = True

Should logs go into the ERROR_LOG_PATH?

wa_kat.settings.LOG_VIA_UDP = True

Should logs go to the log server?

wa_kat.settings.LOG_TO_STDOUT = False

Should logs go to the stdout?

wa_kat.settings.LOG_TO_SENTRY = True

Should logs go to Sentry?

analyzers subpackage:

author_detector

annotation_detector

creation_date_detector

keyword_detector

language_detector

place_detector

title_detector

source_string container

shared submodule

connectors subpackage:

aleph connector

seeder connector

convertors subpackage:

mrc convertor

Dublin core convertor

iso_codes convertor

REST API subpackage:

bottle_index

keywords API

REST Database

Analyzers API

Aleph API

virtual_fs API

to_output API

shared submodule

DB subpackage:

request_info

downloader

wa_kat.db.downloader.download(url)[source]

Download url and return it as utf-8 encoded text.

Parameters:url (str) – What should be downloaded?
Returns:Content of the page.
Return type:str

worker

Source code

This project is released as opensource (MIT) and source codes can be found at GitHub:

Indices and tables