seccs — the SECure Content Store¶
seccs is a Python library that realizes a secure and efficient hash-table-like data structure for contents on top of any existing key-value store as provided by, e.g., cloud storage providers.
It has been developed as part of the work [LS17] at CISPA, Saarland University.
Contents:
Installation¶
Run:
$ pip install seccs
If you want to use AES-SIV encryption (you probably want!), you also need to install PyCrypto 2.7a1 which is not yet available in PyPI:
$ pip install https://ftp.dlitz.net/pub/dlitz/crypto/pycrypto/pycrypto-2.7a1.tar.gz
Usage and Overview¶
seccs is a Python implementation of sec-cs, a secure and efficient
hash-table-like data structure for contents. It stores its data on top of any
existing database providing a key-value store interface. Thus, it is likewise
usable with in-memory dict
objects, persistent databases like
ZODB
, and many cloud storage providers.
Its details are described in [LS17]. In short, it is suitable for usage on untrusted cloud storage and has the following desirable properties:
- Confidentiality:
- Stored contents are securely encrypted using a symmetric key.
- Authenticity:
- sec-cs guarantees authenticity of all stored contents, irrespective of gurantees of the underlying database.
- Storage Efficiency:
- Data deduplication strategies are applied to all stored contents. When storing new contents, overlapping parts of existing contents are automatically reused as to avoid redundancy. sec-cs is optimized for efficiency in presence of many similar contents: Storage costs of an n-bytes content that differs only slightly from an existing content are in O(log n).
Typical Use Case¶
In the most-typical configuration, sec-cs chunks its contents hierarchically using ML-CDC (see [LS17]), usually relying on Rabin Karp hashes, and stores the resulting nodes in a database after applying AES-SIV-256 for encryption and authentication. From a user perspective, we have to initialize a suitable database object and a 32-bytes key first.
- Database and key setup:
>>> database = dict() >>> import os >>> key = os.urandom(32)
Note that we might want to store the database and the key at some persistent location in practice.
Next, we need to create a crypto wrapper which is in charge of all the
cryptographic operations. Depending on our security goals (e.g., whether
encryption is required), we could choose any suitable wrapper from
seccs.crypto_wrapper
. Afterwards, we can instantiate the data structure.
- Choice of crypto wrapper and instantiation of data structure:
>>> import seccs >>> crypto_wrapper = seccs.crypto_wrapper.AES_SIV_256(key) # install PyCrypto>=2.7a1 to use AES-SIV >>> seccs = seccs.SecCSLite(256, database, crypto_wrapper) # 256 is the chunk size
Note
Internally, sec-cs splits contents into chunks, creates a tree of chunks for each of them and inserts each node separately into the database. The first parameter specifies the desired average size of nodes inserted into the database. As deduplication is performed at the chunk level, large chunk sizes decrease deduplication performance, but they also create less storage overhead when storing non-deduplicable contents as fewer nodes have to be stored.
Performance is discussed in detail in [LS17]. If high redundancy is expected, 256 bytes is typically a good compromise; otherwise, larger chunk sizes might be more suitable.
- We can now insert contents...
>>> content = "This is a test content." >>> digest = seccs.put_content(content) >>> repr(digest) '\x08,f+\xa74\xdc\x0f\xe5Oo\xcb;\x83\xb9T\x00\x00\x00\x00\x00\x00\x00\x17'
- ...retrieve them...
>>> seccs.get_content(digest) This is a test content.
- ...and delete them as soon as they are not needed anymore:
>>> seccs.delete_content(digest)
Storage Efficiency¶
seccs avoids redundancy in the database wherever possible, as gets clear in the following example.
- Consider this function for measuring the database‘s current storage costs in bytes:
>>> import sys >>> def dbsize(db): >>> return sum([sys.getsizeof(k) + sys.getsizeof(v) for (k, v) in db.items()])
- Initially, the database is empty:
>>> dbsize(database) 0
- Insertion of a 1 MiB content clearly causes some storage costs:
>>> content1 = os.urandom(1024*1024) >>> digest1 = seccs.put_content(content1) >>> dbsize(database) 1583030
- But inserting the same content for a second time does not incur additional costs:
>>> content2 = content1 >>> digest2 = seccs.put_content(content2) >>> digest1 == digest2 # identical contents yield identical digests True >>> dbsize(database) 1583030
Clearly, the database grows if different contents are inserted. However, these costs are low if inserted contents are similar to existing ones.
- Only about 2.3 KiB are required to store another 1 MiB content with one byte changed:
>>> content3 = ''.join([content1[:512*1024], 'x', content1[512*1024+1:]]) >>> digest3 = seccs.put_content(content3) >>> dbsize(database) 1585395
- Costs are similar even if the identical parts are shifted...
>>> content4 = ''.join([content1[:512*1024], 'xyz', content1[512*1024+1:]]) >>> digest4 = seccs.put_content(content4) >>> dbsize(database) 1588010
- ...and deduplication is also performed if a content consists of parts of different existing contents:
>>> content5 = ''.join([content1, content3, content4]) >>> digest5 = seccs.put_content(content5) >>> dbsize(database) 1591009
In the last example, the growth was about 3 KiB.
- Furthermore, storage space is reclaimed completely when contents are removed:
>>> seccs.delete_content(digest5) >>> seccs.delete_content(digest4) >>> seccs.delete_content(digest3) >>> seccs.delete_content(digest2) >>> dbsize(database) 1583030 >>> seccs.delete_content(digest1) >>> dbsize(database) 0
Note
Every seccs.delete_content()
call undos eactly one
seccs.put_content()
call. Thus, even if the same content has been
inserted twice, yielding only a single digest, it has to be deleted twice as
well to get actually removed.
- References:
[LS17] (1, 2, 3) Dominik Leibenger and Christoph Sorge (2017). sec-cs: Getting the Most out of Untrusted Cloud Storage. In Proceedings of the 42nd IEEE Conference on Local Computer Networks (LCN 2017), 2017. (Preprint: arXiv:1606.03368)
seccs package¶
Module contents¶
sec-cs — the SECure Content Store.
This module provides an implementation of the secure content store data structure introduced in [LS16].
sec-cs allows secure and efficient storage of contents in an existing key-value database, providing the following features:
- Confidentiality:
- Stored contents are securely encrypted using a symmetric key.
- Authenticity:
- sec-cs guarantees authenticity of all stored contents, irrespective of gurantees of the underlying database.
- Storage Efficiency:
- Data deduplication strategies are applied to all stored contents. When storing new contents, overlapping parts of existing contents are automatically reused as to avoid redundancy. Storage costs of an n-bytes content that differs only slightly from an existing content are in O(log n).
Note
The only sec-cs implementation currently included in this module is called
SecCSLite
. While it is likely suitable for many projects and can be
used as is, it is actually intended as a base class for a much more powerful
variant, SecCS
, which makes some slight changes to the internal
storage structure and will be published in the near future.
References
[LS16] | (1, 2) Dominik Leibenger and Christoph Sorge (2016). sec-cs: Getting the Most out of Untrusted Cloud Storage. arXiv:1606.03368 |
-
class
seccs.
SecCSLite
(chunk_size, database, crypto_wrapper, chunking_strategy=None, reference_counter=None, **kwargs)[source]¶ Bases:
object
Secure Content Store lite.
Basic implementation of the Secure Content Store data structure, supports only insertion (put), retrieval (get) and deletion (delete) of contents.
Parameters: - chunk_size (int) – Target chunk size, i.e., expected size of all chunks stored in the database.
- database – Persistent database used as backend. Can be any object with a dict-like interface, i.e., any object implementing the operations __getitem__, __setitem__, __delitem__ and __contains__.
- crypto_wrapper (
crypto_wrapper.BaseCryptoWrapper
) – Crypto wrapper object that specifies cryptographic operations to be applied to data stored in the database. - chunking_strategy (Optional[
fastchunking.BaseChunkingStrategy
]) – Chunking strategy that shall be applied to contents. Defaults to Rabin-Karp-based content defined chunking with 48-bytes window size. - reference_counter (Optional[
seccs.rc.BaseReferenceCounter
]) – Reference counting strategy. By default, reference counters are stored in database under keys key || “r”, where key is the key whose references are counted. - **kwargs –
Extra keyword arguments that you should NOT use unless you really, really know what you are doing, e.g.:
- length_to_height_fn: Function that resolves content lengths to appropriate chunk tree heights. May be used to modify the multi-level chunking approach performed by default, e.g., to degrade it to single-level chunking or similar.
- height_to_chunk_size_fn: Function that computes the target chunk size for a specific level (height) of a chunk tree. May be used to create imbalanced chunk trees.
Raises: seccs.UnsupportedChunkSizeError
– If the chosen chunk size would create superchunk nodes with less than two expected children as efficiency guarantees would fail in this case (see [LS16]).-
delete_content
(k, ignore_rc=False)[source]¶ Delete a content from the data structure.
Decreases the content’s reference counter and deletes its root chunk (possibly including children) if no references are left.
Parameters: - k (str) – The digest under which the content is stored.
- ignore_rc (Optional[bool]) – If True, decrease of reference counter of the root node is skipped and root node (possibly including children) is deleted straight away.
-
get_content
(k)[source]¶ Retrieve a content from the data structure.
Parameters: k (str) – The digest under which the content is stored. Returns: The content bytestring. Return type: str
-
put_content
(m, ignore_rc=False)[source]¶ Insert a content into the data structure.
Parameters: - m (str) – The message or content that shall be processed and inserted into the data structure.
- ignore_rc (Optional[bool]) – If True, increase of reference counter for the root node of the generated chunk tree is skipped. Defaults to False.
Returns: Digest of the content that allows its retrieval using
get_content()
.Return type: str
-
put_content_and_check_if_new
(m, ignore_rc=False)[source]¶ Insert a content into the data structure.
Like
put_content()
, but return value includes information whether the content had been in the data structure before.Parameters: - m (str) – The message or content that shall be processed and inserted into the data structure.
- ignore_rc (Optional[bool]) – If True, increase of reference counter for the root node of the generated chunk tree is skipped. Defaults to False.
Returns: (digest, is_new), where digest is the content’s digest that allows its retrieval using
get_content()
, and is_new is True if the content has been inserted for the first time and False if it had existed before.Return type: tuple
Submodules¶
seccs.crypto_wrapper module¶
Crypto wrapper implementations.
A crypto wrapper encapsulates all cryptographic operations that have to be applied to node representations before they can be safely inserted into the (untrusted) database, e.g., encryption and authentication. It defines the keys (digests) under which (wrapped) node representations are stored and specifies how node representations can be extracted from (digest, value) pairs stored in the database.
This module includes several crypto wrapper implementations with different security properties.
-
class
seccs.crypto_wrapper.
AES_SIV_256
(key)[source]¶ Bases:
seccs.crypto_wrapper.BaseCryptoWrapper
AES-SIV-256 crypto wrapper.
Provides confidentiality and authenticity for chunk tree nodes based on a symmetric 32-bytes key specified during instantiation.
Root nodes are handled identically to inner nodes at the same level.
- Nodes are represented as follows:
- value: AES-SIV-256(<key>, <value>, additional_data=<height>)
- digest: <digest produced by AES-SIV-256>
Parameters: key (str) – Cryptographic key used for symmetric encryption and authentication. Note
Requires PyCrypto >= 2.7a1.
-
unwrap_value
(value, digest, height, is_root, length=-1)[source]¶ Decrypts and verifies node representation and returns the result on success.
Raises: AuthenticityError
– If digest does not match.
-
wrap_value
(value, height, is_root)[source]¶ Encrypts node representation using deterministic authenticated encryption, i.e., with AES in SIV mode, including node height as additional data that is authenticated, resulting in a digest (MAC) that is used as digest and a ciphertext that is used as value.
Note
As AES-SIV cannot encrypt empty contents, a distinguished zero digest is artifically assigned to empty node representations instead.
-
class
seccs.crypto_wrapper.
AES_SIV_256_DISTINGUISHED_ROOT
(key)[source]¶ Bases:
seccs.crypto_wrapper.AES_SIV_256
AES-SIV-256 crypto wrapper with distinguished root representation.
Provides confidentiality and authenticity for chunk tree nodes based on a symmetric 32-bytes key specified during instantiation.
Root nodes are handled differently from inner nodes at the same level.
- Nodes are represented as follows:
- value: AES-SIV-256(<key>, <value>, additional_data=<height>||<is_root>)
- digest: <digest produced by AES-SIV-256>
Parameters: key (str) – Cryptographic key used for symmetric encryption and authentication. Note
Requires PyCrypto >= 2.7a1.
-
unwrap_value
(value, digest, height, is_root, length=-1)[source]¶ Decrypts and verifies node representation and returns the result on success.
Raises: AuthenticityError
– If digest does not match.
-
wrap_value
(value, height, is_root)[source]¶ Encrypts node representation using deterministic authenticated encryption, i.e., with AES in SIV mode, including node height and is_root flag as additional data that is authenticated, resulting in a digest (MAC) that is used as digest and a ciphertext that is used as value.
-
exception
seccs.crypto_wrapper.
AuthenticityError
[source]¶ Bases:
seccs.crypto_wrapper.IntegrityError
Raised if an authenticity verification fails.
-
class
seccs.crypto_wrapper.
BaseCryptoWrapper
[source]¶ Bases:
object
Abstract class specifying the crypto wrapper interface.
-
unwrap_value
(value, digest, height, is_root, length=-1)[source]¶ Converts the representation of a node as stored in the database (e.g., an encrypted representation) into its normal, e.g., decrypted, representation.
Parameters: - value (str) – Wrapped node representation.
- digest (str) – Key under which the node is stored in the database.
- height (int) – Height of the node in its chunk tree.
- is_root (bool) – Whether or not the node is the chunk tree’s root.
- [Optional (length) – Length of the content represented by this node. Defaults to -1.
Returns: Node representation.
Return type: str
-
wrap_value
(value, height, is_root)[source]¶ Converts a node representation into a (digest, value) pair that can be safely inserted into the database.
Parameters: - value (str) – The node representation.
- height (int) – The height of the node in its chunk tree.
- is_root (bool) – Whether or not the node is the chunk tree’s root.
Returns: (wrapped_value, digest).
Return type: tuple
-
-
class
seccs.crypto_wrapper.
HMAC_SHA_256
(key)[source]¶ Bases:
seccs.crypto_wrapper.BaseCryptoWrapper
HMAC-SHA-256 crypto wrapper.
Provides authenticity for chunk tree nodes based on a symmetric 32-bytes key specified during instantiation.
Root nodes are handled identically to inner nodes at the same level.
- Nodes are represented as follows:
- value: <value>
- digest: HMAC-SHA-256(<key>, <height> || <value>)
Parameters: key (str) – Cryptographic key used for symmetric authentication. -
unwrap_value
(value, digest, height, is_root, length=-1)[source]¶ Verifies SHA-256-based HMAC and returns node representation on success.
Raises: AuthenticityError
– If digest does not match.
-
class
seccs.crypto_wrapper.
HMAC_SHA_256_DISTINGUISHED_ROOT
(key)[source]¶ Bases:
seccs.crypto_wrapper.HMAC_SHA_256
HMAC-SHA-256 crypto wrapper with distinguished root representation.
Provides authenticity for chunk tree nodes based on a symmetric 32-bytes key specified during instantiation.
Root nodes are handled differently from inner nodes at the same level.
- Nodes are represented as follows:
- value: <value>
- digest: HMAC-SHA-256(<key>, <height> || <is_root> || <value>)
Parameters: key (str) – Cryptographic key used for symmetric authentication. -
unwrap_value
(value, digest, height, is_root, length=-1)[source]¶ Verifies SHA-256-based HMAC and returns node representation on success.
Raises: AuthenticityError
– If digest does not match.
-
exception
seccs.crypto_wrapper.
IntegrityError
[source]¶ Bases:
exceptions.ValueError
Raised if an integrity verification fails.
-
class
seccs.crypto_wrapper.
SHA_256
[source]¶ Bases:
seccs.crypto_wrapper.BaseCryptoWrapper
SHA-256 crypto wrapper.
Provides integrity for chunk tree nodes.
- Nodes are represented as follows:
- value: <value>
- digest: SHA-256(<value>)
-
unwrap_value
(value, digest, height, is_root, length=-1)[source]¶ Verifies SHA-256 hash and returns node representation on success.
Raises: IntegrityError
– If digest does not match.
seccs.rc module¶
Simple reference counter implementations.
-
class
seccs.rc.
BaseReferenceCounter
[source]¶ Bases:
object
Abstract base class for reference counters.
-
dec
(key)[source]¶ Abstract decrement interface.
Parameters: key – Key whose reference counter shall be decremented. Returns: Number of references of key after decrement.
-
-
class
seccs.rc.
DatabaseReferenceCounter
(database)[source]¶ Bases:
seccs.rc.BaseReferenceCounter
Database-backed reference counter.
Uses a given database to store reference counters. The reference counter of an element key is stored as follows:
- If its value is 0, key is not stored in the database.
- If its value is > 0, its int value is stored in the database under key.
Parameters: database – Database object with a dict-like interface, i.e., implementing the operations __getitem__, __setitem__ and __delitem__.
-
class
seccs.rc.
KeySuffixDatabaseReferenceCounter
(database, suffix)[source]¶ Bases:
seccs.rc.DatabaseReferenceCounter
Database-backed reference counter.
Similar to
DatabaseReferenceCounter
, but the reference counter of a key is not stored directly under key, but under key || suffix.Parameters: - database – Database object with a dict-like interface, i.e., implementing the operations __getitem__, __setitem__ and __delitem__.
- suffix – Suffix for keys.
-
class
seccs.rc.
NoReferenceCounter
[source]¶ Bases:
seccs.rc.BaseReferenceCounter
Non-counting reference counter, always returns 1 for any key.
Can be used to disable reference counting where a reference counter is required.