edeposit.amqp.storage

Long term storage subsystem for the E-deposit project.

This project allows to store and retreive publications over AMQP and also to optionally access accessible publications via HTTP using builtin webserver written in bottle.py.

Package structure

File relations

_images/relations.png

API

Storage:

archive_storage wrapper

This module provides frontend API for storing / retreiving DBArchive from/to the universal object database.

storage.archive_storage.save_archive(archive)

Save archive into database and into proper indexes.

Attr:
archive (obj): Instance of the DBArchive.
Returns:

DBArchive without data.

Return type:

obj

Raises:
  • InvalidType – When the archive is not instance of DBArchive.
  • UnindexablePublication – When there is no index (property) which can be used to index archive in database.
storage.archive_storage.search_archives(query)

Return list of DBArchive which match all properties that are set (not None) using AND operator to all of them.

Example

result = storage_handler.search_publications(
DBArchive(isbn=”azgabash”)

)

Parameters:query (obj) – DBArchive with some of the properties set.
Returns:List of matching DBArchive or [] if no match was found.
Return type:list
Raises:InvalidType – When the query is not instance of DBArchive.

publication_storage wrapper

This module provides frontend API for storing / retreiving DBPublication from/to the universal object database.

storage.publication_storage.save_publication(pub)

Save pub into database and into proper indexes.

Attr:
pub (obj): Instance of the DBPublication.
Returns:

DBPublication without data.

Return type:

obj

Raises:
  • InvalidType – When the pub is not instance of DBPublication.
  • UnindexablePublication – When there is no index (property) which can be used to index pub in database.
storage.publication_storage.search_pubs_by_uuid(uuid)

Search publications by uuid.

Parameters:uuid (str) – UUID of publication.
Returns:List of matching DBPublication or [] if no match was found.
Return type:list
storage.publication_storage.search_publications(query)

Return list of DBPublication which match all properties that are set (not None) using AND operator to all of them.

Example

result = storage_handler.search_publications(
DBPublication(isbn=”azgabash”)

)

Parameters:query (obj) – DBPublication with some of the properties set.
Returns:List of matching DBPublication or [] if no match was found.
Return type:list
Raises:InvalidType – When the query is not instance of DBPublication.

tree_handler module

This module provides database for Tree instances.

class storage.tree_handler.TreeHandler(conf_path='/home/docs/checkouts/readthedocs.org/user_builds/edeposit-amqp-storage/checkouts/stable/src/edeposit/amqp/storage/zconf/zeo_client.conf', project_key='tree_storage')

This class is used as database handler for Tree instances.

name_db_key

str – Key for the name_db.

name_db

dict – Database handler dict for name.

aleph_id_db_key

str – Key for the aleph_id_db.

aleph_id_db

dict – Database handler dict for aleph_id.

issn_db_key

str – Key for the issn_db.

issn_db

dict – Database handler dict for issn.

path_db_key

str – Key for the path_db.

path_db

dict – Database handler dict for path.

parent_db_key

str – Key for the parent_db.

parent_db

dict – Database handler dict for parent.

Constructor.

Parameters:
add_tree(*args, **kwargs)

Add tree into database.

Parameters:
  • tree (obj) – Tree instance.
  • parent (ref, default None) – Reference to parent tree. This is used for all sub-trees in recursive call.
remove_tree_by_path(path)

Remove the tree from database by given path.

Parameters:path (str) – Path of the tree.
remove_tree(tree)

Remove the tree from database using tree object to identfy the path.

Parameters:tree (obj) – Tree instance.
trees_by_issn(*args, **kwargs)

Search trees by issn.

Parameters:issn (str) – Tree.issn property of Tree.
Returns:Set of matching Tree instances.
Return type:set
trees_by_path(*args, **kwargs)

Search trees by path.

Parameters:path (str) – Tree.path property of Tree.
Returns:Set of matching Tree instances.
Return type:set
trees_by_subpath(*args, **kwargs)

Search trees by sub_path using Tree.path.startswith(sub_path) comparison.

Parameters:sub_path (str) – Part of the Tree.path property of Tree.
Returns:Set of matching Tree instances.
Return type:set
get_parent(*args, **kwargs)

Get parent for given tree or alt if not found.

Parameters:
  • tree (obj) – Tree instance, which is already stored in DB.
  • alt (obj, default None) – Alternative value returned when tree is not found.
Returns:

Tree parent to given tree.

Return type:

obj

storage.tree_handler.tree_handler(*args, **kwargs)

Singleton TreeHandler generator. Any arguments are given to TreeHandler, when it is first created.

Returns:TreeHandler instance.
Return type:obj

storage_handler module

This module handles the database, maintains indexes and provides search function over this indexes.

exception storage.storage_handler.InvalidType

Raised in case that object you are trying to store doesn’t have required interface.

exception storage.storage_handler.UnindexableObject

Raised in case, that object doesn’t have at least one attribute set.

class storage.storage_handler.StorageHandler(project_key, conf_path='/home/docs/checkouts/readthedocs.org/user_builds/edeposit-amqp-storage/checkouts/stable/src/edeposit/amqp/storage/zconf/zeo_client.conf')

Object database with indexing by the object attributes.

Each stored object is required to have following properties:

  • indexes (list of strings)
  • project_key (string)

For example:

class Person(Persistent):
    def __init__(self, name, surname):
        self.name = name
        self.surname = surname

    @property
    def indexes(self):
        return [
            "name",
            "surname",
        ]

    @property
    def project_key(self):
        return PROJECT_KEY

Note

I suggest to use properties, because that way the values are not stored in database, but constructed at request by the property methods.

Constructor.

Parameters:
  • project_key (str) – Project key which is used for the root of DB.
  • conf_path (str) – Path to the client zeo configuration file. Default settings.ZEO_CLIENT_PATH.
store_object(obj)

Save obj into database and into proper indexes.

Attr:
obj (obj): Indexable object.
Raises:
  • InvalidType – When the obj doesn’t have right properties.
  • Unindexableobjlication – When there is no indexes defined.
search_objects(query)

Return list of objects which match all properties that are set (not None) using AND operator to all of them.

Example

result = storage_handler.search_objects(
DBPublication(isbn=”azgabash”)

)

Parameters:query (obj) – Object implementing proper interface with some of the properties set.
Returns:List of matching objects or [] if no match was found.
Return type:list
Raises:InvalidType – When the query doesn’t implement required properties.

web_tools submodule

Functions shared by the server script and also by the backend.

exception storage.web_tools.PrivatePublicationError

Bases: exceptions.UserWarning

Indication that publication is private.

storage.web_tools.compose_path(pub, uuid_url=False)

Compose absolute path for given pub.

Parameters:
  • pub (obj) – DBPublication instance.
  • uuid_url (bool, default False) – Compose URL using UUID.
Returns:

Absolute url-path of the publication, without server’s address and protocol.

Return type:

str

Raises:

PrivatePublicationError – When the pub is private publication.

storage.web_tools.compose_tree_path(tree, issn=False)

Compose absolute path for given tree.

Parameters:
  • pub (obj) – Tree instance.
  • issn (bool, default False) – Compose URL using ISSN.
Returns:

Absolute path of the tree, without server’s address and protocol.

Return type:

str

storage.web_tools.compose_full_url(pub, uuid_url=False)

Compose full url for given pub, with protocol, server’s address and port.

Parameters:
  • pub (obj) – DBPublication instance.
  • uuid_url (bool, default False) – Compose URL using UUID.
Returns:

Absolute url of the publication.

Return type:

str

Raises:

PrivatePublicationError – When the pub is private publication.

storage.web_tools.compose_tree_url(tree, issn_url=False)

Compose full url for given tree, with protocol, server’s address and port.

Parameters:
  • tree (obj) – Tree instance.
  • issn_url (bool, default False) – Compose URL using ISSN.
Returns:

URL of the tree

Return type:

str

settings submodule

Module is containing all necessary global variables for the package.

Module also has the ability to read user-defined data from following paths:

  • SETTINGS_PATH env variable file pointer to .json file.
  • $HOME/_SETTINGS_PATH
  • /etc/_SETTINGS_PATH

See _SETTINGS_PATH for details.

Note

When the first path is found, others is ignored.

Example of the configuration file ($HOME/edeposit/storage.json):

{
    "PRIVATE_INDEX_USERNAME": "username",
    "PRIVATE_INDEX_PASSWORD": "password"
}

Example of starting the program with env variable:

export WA_KAT_SETTINGS="/tmp/conf.json"; bin/edeposit_storage_server.py
Attributes
storage.settings.ZCONF_PATH = '/home/docs/checkouts/readthedocs.org/user_builds/edeposit-amqp-storage/checkouts/stable/src/edeposit/amqp/storage/zconf'

Path to the directory with zeo.conf and zeo_client.conf.

storage.settings.ZEO_SERVER_PATH = '/home/docs/checkouts/readthedocs.org/user_builds/edeposit-amqp-storage/checkouts/stable/src/edeposit/amqp/storage/zconf/zeo.conf'
storage.settings.ZEO_CLIENT_PATH = '/home/docs/checkouts/readthedocs.org/user_builds/edeposit-amqp-storage/checkouts/stable/src/edeposit/amqp/storage/zconf/zeo_client.conf'
storage.settings.PUB_PROJECT_KEY = 'pub_storage'

This is used in ZODB. DON’T CHANGE THIS.

storage.settings.ARCH_PROJECT_KEY = 'archive_storage'

This is used in ZODB. DON’T CHANGE THIS.

storage.settings.TREE_PROJECT_KEY = 'tree_storage'

This is used in ZODB. DON’T CHANGE THIS.

storage.settings.PRIVATE_INDEX = False

Should the index be private?

storage.settings.PRIVATE_INDEX_USERNAME = 'edeposit'

Username for private index.

storage.settings.PRIVATE_INDEX_PASSWORD = ''

Password for private index. You HAVE TO set it.

storage.settings.PUBLIC_DIR = ''

Path to the directory for public publications.

storage.settings.PRIVATE_DIR = ''

Path to the private directory, for non-downloadabe pubs.

storage.settings.ARCHIVE_DIR = ''

Path to the directory, where the archives will be stored.

storage.settings.HNAS_INDICATOR = ''

Path to the file saved on HNAS, which is used to indicate that HNAS is mounted.

storage.settings.HNAS_IND_ALLOWED = False

Should the HNAS indicator be used or not?

storage.settings.WEB_ADDR = 'localhost'

Address where the webserver should listen.

storage.settings.WEB_PORT = 8080

Port for the webserver.

storage.settings.WEB_SERVER = 'wsgiref'

Use paste for threading.

storage.settings.WEB_DB_TIMEOUT = 30

How often should web refresh connection to DB.

storage.settings.DOWNLOAD_KEY = 'download'

Used as part of the url. Don’t change this later.

storage.settings.UUID_DOWNLOAD_KEY = 'UUID'

Used as part of the url. Don’t change this.

storage.settings.PATH_DOWNLOAD_KEY = 'tree_by_path'

Key used for URL composition for trees.

storage.settings.ISSN_DOWNLOAD_KEY = 'tree_by_issn'

Key used for URL composition for trees.

storage.settings._SETTINGS_PATH = 'edeposit/storage.json'

Appended to default search paths.

storage.settings._ALLOWED = [<type 'str'>, <type 'unicode'>, <type 'int'>, <type 'float'>, <type 'long'>, <type 'bool'>]

Allowed types.

Structures

AMQP:

responses submodule

Structures used for AMQP responses.

class storage.structures.comm.responses.SearchResult

Response to SearchRequest.

records

list – List of matching Publication objects.

Create new instance of SearchResult(records,)

class storage.structures.comm.responses.TreeInfo

Informations about stored trees.

path

str – Path of the tree in storage.

url_by_path

str – Full url-encoded path of the tree in storage.

url_by_issn

str – URL composed from ISSN.

Create new instance of TreeInfo(path, url_by_path, url_by_issn)

requests submodule

Structures used for AMQP communication requests.

class storage.structures.comm.requests.SearchRequest(query, light_request=False)

Retreive publication from archive using query - instance of Publication or Archive. Any property of the is used to retreive data.

query

obj – Instance of Publication or Archive.

light_request

bool, default False – If true, don’t return the data. This is used when you need just the metadata info.

class storage.structures.comm.requests.SaveRequest(record)

Save record to the storage.

record

obj – Instance of the Publication,

:class:`.Archive`.

Create new instance of SaveRequest(record,)

Publication structure

Communication structure used by AMQP.

class storage.structures.comm.publication.Publication(*args, **kwargs)

Bases: storage.structures.comm.publication.Publication

Communication structure used to sent data to storage subsystem over AMQP.

title

str – Title of the publication.

author

str – Name of the author.

pub_year

str – Year when the publication was released.

isbn

str – ISBN for the publication.

urnnbn

strURN:NBN for the publication.

uuid

str – UUID string to pair the publication with edeposit.

aleph_id

str – ID used in aleph.

producent_id

str – ID used for producent.

is_public

bool – Is the file public?

filename

str – Original filename.

is_periodical

bool – Is the publication periodical?

path

str – Path in the tree (used for periodicals).

b64_data

str – Base64 encoded data ebook file.

url

str – URL in case that publication is public.

file_pointer

str – Pointer to the file on the file server.

Archive structure

Communication structure used by AMQP.

class storage.structures.comm.archive.Archive(*args, **kwargs)

Bases: storage.structures.comm.archive.Archive

Communication structure used to sent data to storage subsystem over AMQP.

isbn

str – ISBN for the archive.

uuid

str – UUID string to pair the archive with edeposit.

aleph_id

str – ID used in aleph.

b64_data

str – Base64 encoded data ebook file.

dir_pointer

str – Pointer to the directory on the file server.

Tree structure

Communication structure used by AMQP.

class storage.structures.comm.tree.Tree(*args, **kwargs)

Bases: storage.structures.comm.tree.Tree

Communication structure used to sent data to storage subsystem over AMQP.

name

str – Name of the periodical.

sub_trees

list – List of other trees.

sub_publications

list – List of sub-publication UUID’s.

aleph_id

str – ID used in aleph.

issn

str – ISSN given to the periodical.

is_public

bool – Is the tree public?

path

str, default “” – Path in the periodical structures.

Constructor.

Parameters:
  • name (str) – Name of the periodical.
  • sub_trees (list) – List of other trees.
  • sub_publications (list) – List of sub-publication UUID’s.
  • aleph_id (str) – ID used in aleph.
  • issn (str) – ISSN given to the periodical.
  • is_public (bool) – Is the tree public?
Raises:

ValueError – In case that name is not set, or sub_trees or sub_publications is not list/tuple.

path
indexes

Return list of property names, which may be used for indexing in DB.

Returns:List of strings.
Return type:list
collect_publications()

Recursively collect list of all publications referenced in this tree and all sub-trees.

Returns:List of UUID strings.
Return type:list

Database:

DBPublication structure

Structure used in ZODB (database) for storing publications.

class storage.structures.db.db_publication.DBPublication(**kwargs)

Bases: persistent.Persistent, kwargs_obj.kwargs_obj.KwargsObj

Database structure used to store basic metadata about Publications.

title

str – Title of the publication.

author

str – Name of the author.

pub_year

str – Year when the publication was released.

isbn

str – ISBN for the publication.

urnnbn

strURN:NBN for the publication.

uuid

str – UUID string to pair the publication with edeposit.

aleph_id

str – ID used in aleph.

producent_id

str – ID used for producent.

is_public

bool – Is the file public?

filename

str – Original filename.

is_periodical

bool – Is the publication periodical?

path

str – Path in the tree (used for periodicals).

file_pointer

str – Pointer to the file on the file server.

classmethod from_comm(pub)

Convert communication namedtuple to this class.

Parameters:pub (obj) – Publication instance which will be converted.
Returns:DBPublication instance.
Return type:obj
url
indexes

Returns – list: List of strings, which may be used as indexes in DB.

project_key
to_comm(light_request=False)

Convert self to Publication.

Returns:Publication instance.
Return type:obj
DBArchive structure

Structure used in ZODB (database) for storing ZIP archives / unpacked directories.

class storage.structures.db.db_archive.DBArchive(**kwargs)

Bases: persistent.Persistent, kwargs_obj.kwargs_obj.KwargsObj

Database structure used to store basic metadata about Archives.

isbn

str – ISBN for the archive.

uuid

str – UUID string to pair the archive with edeposit.

aleph_id

str – ID used in aleph.

dir_pointer

str – Pointer to the directory on the file server.

classmethod from_comm(pub)

Convert communication namedtuple to this class.

Parameters:pub (obj) – Archive instance which will be converted.
Returns:DBArchive instance.
Return type:obj
indexes

Returns – list: List of strings, which may be used as indexes in DB.

project_key
to_comm(light_request=False)

Convert self to Archive.

Returns:Archive instance.
Return type:obj

Generators:

structures_generator script

This script is used to generate Publication, DBPublication, Archive and DBArchive structures.

Installation

Installation of this project is little bit more complicated. Please read installation notes:

Installation notes

Module itself can be installed using PIP:

sudo pip install edeposit.amqp.storage

Configuration

After the installation, some configuration is required. Configuration is done using settings.py script, which reads data from configuration path ~/edeposit/storage.json.

Each uppercase attribute defined in settings can be reconfigured using the storage.json configuration file.

Required configuration options is:

Highly recommended options:

You should definitelly change the WEB_SERVER to paste. By default, the wsgiref backend is used, but that is only single-thread server. Paste will allow multithread access of users to your server.

Also to change the default database paths, you will need to update ZCONF_PATH to path with the ZEO configuration.

Example of the configuration

/etc/edeposit/storage.json:

{
    "PUBLIC_DIR": "/var/storage/public",
    "PRIVATE_DIR": "/var/storage/private",
    "ZCONF_PATH": "/var/storage/zconf",
    "PRIVATE_INDEX": true,
    "PRIVATE_INDEX_PASSWORD": "secret password",
    "WEB_SERVER": "paste"
}

Example of the ZEO configuration

/var/storage/zconf/zeo_client.conf:

<zeoclient>
  server localhost:8090
</zeoclient>

/var/storage/zconf/zeo.conf:

<zeo>
  address localhost:8090
</zeo>

<filestorage>
  path /var/storage/zodb/storage.fs
</filestorage>

<eventlog>
  level INFO
  <logfile>
    path /var/storage/zodb/zeo.log
    format %(asctime)s %(message)s
  </logfile>
</eventlog>

How to run the server

There are three script, which you have to start in order to get full functionality:

  • edeposit_storage_runzeo.sh (database)
  • edeposit_storage_server.py (webserver)
  • edeposit_amqp_storaged.py (amqp handler)

Webserver and AMQP handler are optional, but database script is mandatory.

Supervisord

To run the scripts, you can use supervisord:

[program:storagedaemon]
command = /usr/bin/edeposit_amqp_storaged.py start --foreground
process_name = storagedaemon
directory = /usr/bin
priority = 10
redirect_stderr = true
user = edeposit

[program:storageweb]
command = /usr/bin/edeposit_storage_server.py
process_name = storageweb
directory = /usr/bin
priority = 10
redirect_stderr = true
user = root

[program:storagezeo]
command = /usr/bin/edeposit_storage_runzeo.sh
process_name = storagezeo
directory = /usr/bin
priority = 10
redirect_stderr = true
user = edeposit

For the storageweb, the user must be root only in case you wish to run the web on port 80.

AMQP protocol

Here is the list of Request -> Response pairs describing responses to AMQP communication:

SaveRequest.Archive -> Archive
SaveRequest.Publication -> Publication
SaveRequest.Tree -> TreeInfo

SearchRequest -> SearchResult

Source code

Project is released under the MIT license. Source code can be found at GitHub:

Unittests

Almost every feature of the project is tested by unittests. You can run those tests using provided run_tests.sh script, which can be found in the root of the project.

If you have any trouble, just add --pdb switch at the end of your run_tests.sh command like this: ./run_tests.sh --pdb. This will drop you to PDB shell.

Requirements

This script expects that packages pytest, fake-factory and sh is installed. In case you don’t have it yet, it can be easily installed using following command:

pip install --user pytest fake-factory sh

or for all users:

sudo pip install pytest fake-factory sh

Example

./run_tests.sh
============================= test session starts ==============================
platform linux2 -- Python 2.7.6, pytest-2.8.2, py-1.4.30, pluggy-0.3.1
rootdir: /home/bystrousak/Plocha/Dropbox/c0d3z/prace/edeposit.amqp.storage, inifile:
plugins: cov-1.8.1
collected 35 items

tests/test_amqp_chain.py .....
tests/test_publication_storage.py ........
tests/test_tree_handler.py ........
tests/structures/test_db_archive.py ...
tests/structures/test_db_publication.py ....
tests/structures/test_publication.py .
tests/structures/test_requests.py .
tests/structures/test_tree.py .....

========================== 35 passed in 11.02 seconds ==========================

Indices and tables