Documentation contents

Welcome to mmappickle’s documentation!

This Python 3 module enables to store large structures in a python pickle, in a way that the array can be memory-mapped instead of being copied into the memory. This module is licensed under the LGPL3 license.

Currently, the container is a dictionnary (mmappickle.mmapdict), which keys are unicode strings of less than 256 bytes.

It supports any type of value, but it is only possible to memory map numpy.ndarray and numpy.ma.MaskedArray at present.

It also supports concurrent access (i.e. you can pass a mmappickle.mmapdict as an argument which is called using the multiprocessing Python module).

Why use mmappickle?

Mmappickle is designed for “unstructured” parallel access, with a strong emphasis on adding new data.

Here are two example uses cases:

  • Timelapse hyperspectral scanning

    At each time interval, an hyperspectral image (which is numpy array of a few hundreds of megabytes) is added to the file, along with the related metadata.

    The key features for which mmappickle is useful are:

    • is that it is possible to have an independant viewer process, which is monitoring the file as it is appended, and allows the operator to stop the capture once sufficient data was acquired.
    • all the information is kept in one file, which makes it easier to archive and distribute.
    • the images are not stored in RAM, which would be a problem due to their size.

    Moreover, as the file is a normal pickle, it is not required to install anything more than a standard distribution of Python to have it work. This is useful when distributing files to occasional users, which may be less familiar with Python, and may not be able to use pip to install libraries.

  • Image registration

    When having multiples (hyperspectral) images of the same subject, a common requirement is to register these images to allow creating a combined image containing the information of all of them. Depending on the images, this could be for example HDR imaging, or simply stitching the images together.

    Commonly, a set of files are provided to the algorithm, which computes keypoints, then the transformation parameters, and finally creates an output file. Due to practical reasons intermediate data is rarely kept, as it means more files, and also that the references to the original files must be strictly kept.

    With mmappickle, all the input images are in one file. The registration algorithm can simply add the keypoints and the transformation parameters to the original file. This simplifies debugging and also serves as a cache when using successively different combining algorithms.

Quick start

mmappickle.mmapdict behaves like a dictionary. For example:

>>> from mmappickle import mmapdict
>>> m = mmapdict('/tmp/test.mmdpickle') #could be an existing file
>>> m['a_sample_key'] = 'value'
>>> m['other_key'] = [1,2,3]
>>> print(m['a_sample_key'])
value
>>> m['other_key'][2]
3
>>> del m['a_sample_key']
>>> print(m.keys())
['other_key']

The contents of the dictionary are stored to disk. For example, in another python interpreter:

>>> from mmappickle import mmapdict
>>> m = mmapdict('/tmp/test.mmdpickle')
>>> print(m['other_key'])
[1, 2, 3]

It is also possible to open the file in read-only mode, in which case any modification will fail:

>>> from mmappickle import mmapdict
>>> m = mmapdict('/tmp/test.mmdpickle', True)
>>> m['other_key'] = 'a'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/laurent/git/mmappickle/mmappickle/utils.py", line 22, in require_writable_wrapper
    raise io.UnsupportedOperation('not writable')
io.UnsupportedOperation: not writable

Of course, the main interest is to store numpy arrays:

>>> import numpy as np
>>> from mmappickle.dict import mmapdict
>>> m = mmapdict('/tmp/test.mmdpickle')
>>> m['test'] = np.array([1,2,3], dtype=np.uint8)
>>> m['test'][1] = 4
>>> print(m['test'])
[1 4 3]
>>> print(type(m['test']))
<class 'numpy.core.memmap.memmap'>

As you can see, the m['test'] is now memory-mapped. This means that its content is not loaded in memory, but instead accessed directly from the file.

Unfortunately, the array has to exist in order to serialize it to the mmapdict. If the array exceed the available memory, this won’t work. Instead one should use stubs:

>>> from mmappickle.stubs import EmptyNDArray
>>> m['test_large'] = EmptyNDArray((300,300,300))
>>> print(type(m['test_large']))
<class 'numpy.core.memmap.memmap'>

The matrix in m['test_large'] uses 216M of memory, but it was at no point allocated in RAM. This way, it is possible to allocate arrays larger than the size of the memory. One could have written m['test_large'] = np.empty((300,300,300)), but unfortunately the memory is allocated when calling numpy.empty().

Finally, one last useful trick is the mmappickle.mmapdict.vacuum() method. It allows reclaiming the disk space:

>>> del m['test_large']
>>> #Here, /tmp/test.mmdpickle still occupies ~216M of hard disk
>>> m.vacuum()
>>> #Now the disk space has been reclaimed.

Warning

When running mmappickle.mmapdict.vacuum(), it is crucial that there are no other references to the file content, either in this process or in other. In particular, no memory-mapped array. If this rule is not followed, unfortunate outcomes are anticipated! (crash, data corruption, etc.)

Getting help

Please use mmappickle issue tracker on GitHub to ask any question.

To report bugs, please see Reporting bugs.

Installation

mmappickle can be installed either using pip, or from source.

Since mmappickle requires Python 3, you should ensure that you’re using the correct pip executable. On some distributions, the executable is named pip3.

It is possible to check the Python version of pip using:

pip -V

Similarly, you may need to use py.test-3 instead of py.test.

Installing using pip

To install using pip, simply run:

pip install mmappickle

Manual installation from source

To install manually, run the following:

git clone https://github.com/UniNE-CHYN/mmappickle
cd mmappickle
pip install .

To contribute, use pip install -e instead of pip install. This sets a link to the source folder in the python installation, instead of copying the files.

It is advisable to run the tests to ensure that everything is working. This can be done by running the following command in the mmappickle directory:

py.test

API Documentation

This is the API documentation of the public API of mmappickle. These are the most commonly used functions when using mmappickle.

Developers interested in extending mmappickle, or in understanding the inner workings should read the following page: Internals.

Main module

class mmappickle.mmapdict(file, readonly=None, picklers=None)[source]

class to access a mmap-able dictionnary in a file.

This class is safe to use in a multi-process environment.

__init__(file, readonly=None, picklers=None)[source]

Create or load a mmap dictionnary.

Parameters:
  • file – either a file-like object or a string representing the name of the file.
  • readonly – if file is a string, the file will be open in readonly mode if set to True.
  • picklers – explicit list of picklers. Usually this is not needed (by default, all are used)
writable

True if the file is writable, False otherwise

commit_number

The monotonically increasing commit number of the mmapdict.

This is useful to know if the keys have been changed by another process. If the commit_number hasn’t changed, it is guaranteed that keys() won’t be changed.

Altough it is possible to set the commit number using this property, there is generally no use for this in external code.

__contains__(k)[source]

Check if a key exists in dictionnary

Parameters:k – Key (string) to check for existence
Returns:True if key exists in dictionnary, False otherwise.
__weakref__

list of weak references to the object (if defined)

keys()[source]
Returns:a set-like object providing a view on D’s keys
__setitem__(k, v)[source]

Create or change key k, sets its value to v.

Parameters:
  • k – key, should be an unicode string of binary length <= 255.
  • v – value, any picklable object

When replacing a value, this function adds the new key-value pair at the end of the file, and marks the old one as invalid, but leaves the data in place. As a consequence, this function can be used when using the file concurrently from multiple processes. However, other processes may still be using the old value if they don’t reload the value from the file.

If no concurrent access exists to the file, the old value can be freed using vacuum().

__getitem__(k)[source]

Get value for key k, raise KeyError if the key doesn’t exists in file.

If possible, the data will be returned as a mmap’ed object.

__delitem__(k)[source]

Mark key k as not valid in the file.

Parameters:k – key to remove

This method marks the key as invalid, but leaves the data in place. As a consequence, this function can be used when using the file concurrently from multiple processes. However, other processes may still be using the value if they don’t reload the keys from the file.

If no concurrent access exists to the file, the old value can be freed using vacuum().

vacuum(chunk_size=1048576)[source]

Free all deleted keys, effectively reclaiming disk space.

Only use this function when no mmap exists on the file. Usually it is safer to run it only in part of the code where there is no concurrent access.

Parameters:chunk_size – The size of the buffer used to shift data in the file.

Warning

No mmap should exist on this file (both in this python script, and in others), as the data will be shifted.

If an mmap exists, it could crash the process and/or corrupt the file and/or return invalid data.

fsck()[source]

Attempt to fix the file, if possible.

This function should be called if some data could not be written to a file. This might be the case if, for example, not enough disk space was available.

This method truncates the file and recreates a valid terminator.

Warning

Calling this function may lead to data loss.

Stubs

Internals

mmapdict files are pickle files, containing a dictionary, but with a special format. The main idea is to have a file of predictable structure, to be able to compute the offsets for the memory maps. Moreover, a way to disable a specific key is required, either to replace it or to delete it without changing the offsets of the file.

For example, for the following dictionary:

{'key': 'value', 'test': array([1, 2, 3], dtype=uint8)}

The normal pickle module would output:

    0: \x80 PROTO      4
    2: \x95 FRAME      172
   11: }    EMPTY_DICT
   12: \x94 MEMOIZE
   13: (    MARK
   14: \x8c     SHORT_BINUNICODE 'test'
   20: \x94     MEMOIZE
   21: \x8c     SHORT_BINUNICODE 'numpy.core.multiarray'
   44: \x94     MEMOIZE
   45: \x8c     SHORT_BINUNICODE '_reconstruct'
   59: \x94     MEMOIZE
   60: \x93     STACK_GLOBAL
   61: \x94     MEMOIZE
   62: \x8c     SHORT_BINUNICODE 'numpy'
   69: \x94     MEMOIZE
   70: \x8c     SHORT_BINUNICODE 'ndarray'
   79: \x94     MEMOIZE
   80: \x93     STACK_GLOBAL
   81: \x94     MEMOIZE
   82: K        BININT1    0
   84: \x85     TUPLE1
   85: \x94     MEMOIZE
   86: C        SHORT_BINBYTES b'b'
   89: \x94     MEMOIZE
   90: \x87     TUPLE3
   91: \x94     MEMOIZE
   92: R        REDUCE
   93: \x94     MEMOIZE
   94: (        MARK
   95: K            BININT1    1
   97: K            BININT1    3
   99: \x85         TUPLE1
  100: \x94         MEMOIZE
  101: \x8c         SHORT_BINUNICODE 'numpy'
  108: \x94         MEMOIZE
  109: \x8c         SHORT_BINUNICODE 'dtype'
  116: \x94         MEMOIZE
  117: \x93         STACK_GLOBAL
  118: \x94         MEMOIZE
  119: \x8c         SHORT_BINUNICODE 'u1'
  123: \x94         MEMOIZE
  124: K            BININT1    0
  126: K            BININT1    1
  128: \x87         TUPLE3
  129: \x94         MEMOIZE
  130: R            REDUCE
  131: \x94         MEMOIZE
  132: (            MARK
  133: K                BININT1    3
  135: \x8c             SHORT_BINUNICODE '|'
  138: \x94             MEMOIZE
  139: N                NONE
  140: N                NONE
  141: N                NONE
  142: J                BININT     -1
  147: J                BININT     -1
  152: K                BININT1    0
  154: t                TUPLE      (MARK at 132)
  155: \x94         MEMOIZE
  156: b            BUILD
  157: \x89         NEWFALSE
  158: C            SHORT_BINBYTES b'\x01\x02\x03'
  163: \x94         MEMOIZE
  164: t            TUPLE      (MARK at 94)
  165: \x94     MEMOIZE
  166: b        BUILD
  167: \x8c     SHORT_BINUNICODE 'key'
  172: \x94     MEMOIZE
  173: \x8c     SHORT_BINUNICODE 'value'
  180: \x94     MEMOIZE
  181: u        SETITEMS   (MARK at 13)
  182: .    STOP
highest protocol among opcodes = 4

This works fine, but doesn’t allow random access.

Let’s look at what a mmappickle.dict file looks like, for the same data:

    0: \x80 PROTO      4
    2: \x95 FRAME      13
   11: J    BININT     1
   16: 0    POP
   17: J    BININT     2
   22: 0    POP
   23: (    MARK
   24: \x95     FRAME      20
   33: \x8c     SHORT_BINUNICODE 'key'
   38: \x8c     SHORT_BINUNICODE 'value'
   45: J        BININT     1
   50: 0        POP
   51: \x88     NEWTRUE
   52: 0        POP
   53: \x95     FRAME      110
   62: \x8c     SHORT_BINUNICODE 'test'
   68: \x8c     SHORT_BINUNICODE 'numpy.core.fromnumeric'
   92: \x8c     SHORT_BINUNICODE 'reshape'
  101: \x93     STACK_GLOBAL
  102: \x8c     SHORT_BINUNICODE 'numpy.core.multiarray'
  125: \x8c     SHORT_BINUNICODE 'fromstring'
  137: \x93     STACK_GLOBAL
  138: \x8e     BINBYTES8  b'\x01\x02\x03'
  150: \x8c     SHORT_BINUNICODE 'uint8'
  157: \x86     TUPLE2
  158: R        REDUCE
  159: K        BININT1    3
  161: \x85     TUPLE1
  162: \x86     TUPLE2
  163: R        REDUCE
  164: J        BININT     0
  169: 0        POP
  170: \x88     NEWTRUE
  171: 0        POP
  172: \x95     FRAME      2
  181: d        DICT       (MARK at 23)
  182: .    STOP
highest protocol among opcodes = 4

We can note the following changes:

  • There are hidden values at the beginning (version = 1, file revision = 2)
  • Each key-value couple is in an individual frame, which contains a hidden int (memo max index), finally a hidden TRUE.
  • The numpy array is created using numpy.core.fromnumeric.reshape(numpy.core.multiarray.from_string(data, dtype), shape) instead of the “traditionnal” way

The version field is used to allow further developments, and is fixed to 1 at present. The file revision is increased each time a key of the dictionary is changed, to allow caching when there is concurrent access. Memo max index is used because there may be MEMOIZE/GET/PUT to renumber when pickling values. This is a cache to avoid having to parse all the file.

Finally, the hidden TRUE is a “hack” to allow removing a key. In fact, it is not possible to move data when it’s memmap’ed. To avoid this, the first TRUE is replaced by a POP when deleting the key. In summary, the stack is working in the following way:

  • Key exists: KEY, VALUE, memo max index, POP, TRUE, POP. (reduced as KEY, VALUE)
  • Key doesn’t exist: KEY, VALUE, memo max index, POP, POP, POP. (disappears when reduced)

We can see that the file is composed of three differents parts, which are documented below:

Extending mmappickle

To add support for a new memory mapped value type, one should create a new subclass mmappickle.picklers.base.

This requires some knowledge of the Python internal pickle format, but should be straightforward, using the numpy picklers as inspiration. Feel free to open an issue if more details are required.

Internal API Documentation

class mmappickle.dict._header(mmapdict, _real_header_starts_at=0)[source]

The file header is at the beginning of the file.

It consists in the following pickle ops:

PROTO 4                                (pickle version 4 header)
FRAME <length>
BININT <_file_version_number:32> POP   (version of the pickle dict, 1)
BININT <_file_commit_number:32> POP    (commit id of the pickle dict, incremented every time something changes)
<additional data depending on the _file_version_number> (none, for version 1)
MARK                                   (start of the dictionnary)
__init__(mmapdict, _real_header_starts_at=0)[source]
Parameters:
  • mmapdict – mmapdict object containing the data
  • _real_header_starts_at – Offset of the header (normally not used)
exists
Returns:True if file contains something
write_initial()[source]

Write the initial header to the file

is_valid()[source]
Returns:True if file has a valid mmapdict pickle header, False otherwise.
commit_number

Commit number (revision) in the file

__len__()[source]
Returns:the total length of the header.
__weakref__

list of weak references to the object (if defined)

class mmappickle.dict._terminator(mmapdict)[source]

Terminator is the suffix at the end of the mmapdict file.

It consists is the following pickle ops:

FRAME 2
DICT (make the dictionnary)
STOP (end of the file)
__init__(mmapdict)[source]
Parameters:mmapdict – mmapdict object containing the data
__len__()[source]
Returns:the length of the terminator
exists
Returns:True if the file ends with the terminator, False otherwise
write()[source]

Write the terminator at the end of the file, if it doesn’t exist

__weakref__

list of weak references to the object (if defined)

class mmappickle.dict._kvdata(mmapdict, offset)[source]

kvdata is the structure holding a key-value data entry.

The trick is that it should be either two values, key and value, or nothing, if the value is deleted.

To do this, we put the key and the value on the stack. Then we either push a NEWTRUE+POP (which results in a NO-OP), or we push a POP+POP (which removes both the key and the value). Since NEWTRUE and POP both have length 1, it is easy to make the substitution.

Another trick is to cache the maximum value of the memoization index (for GET and PUT), to ensure that we have no duplicates.

The _kvdata structure has the following pickle ops:

FRAME <length>
SHORT_BINUNICODE <length> <key bytes>
<<< data >>>
BININT <max memo idx> POP (max memo index of this part)
NEWTRUE|POP POP (if NEWTRUE POP: entry is valid, else entry is deactivated.)
__init__(mmapdict, offset)[source]
Parameters:
  • mmapdict – mmapdict object containing the data
  • offset – Offset of the key-value data
__len__()[source]
Returns:the length of the key-value data
offset
Returns:the offset in the file of the key-value data
end_offset
Returns:the end-offset in the file of the key-value data
_frame_length
Returns:the frame length for this _kvdata.

This is done either by reading it in the file, or by computing it if it doesn’t exist

_exists_initial
Returns:True if the file contains the header of the frame
data_offset
Returns:the offset of the pickled data
key_length
Returns:the binary length of the key
_valid_offset
Returns:the offset of the valid byte
_memomaxidx_offset
Returns:the offset of the max memo index
data_length
Returns:True if the file contains the header of the frame
key
Returns:the key as an unicode string
memomaxidx
Returns:the (cached) max memo index
valid
Returns:True if the key-value couple is valid, False otherwise (i.e. key was deleted)
_write_if_allowed()[source]

Write to file, if it is possible to do so

__weakref__

list of weak references to the object (if defined)

class mmappickle.picklers.base.BasePickler(parent_object)[source]

Bases: object

Picklers will be attempted in decreasing priority order

__init__(parent_object)[source]

Initialize self. See help(type(self)) for accurate signature.

is_valid(offset, length)[source]

Return True if object starting at offset in f is valid.

File position is kept.

is_picklable(obj)[source]

Return True if object can be pickled with this pickler

read(offset, length)[source]

Return the unpickled object read from offset, and the length read. The file position is kept.

write(obj, offset, memo_start_idx=0)[source]

Write the pickled object to the file stream, the file position is kept.

Returns a tuple (number of bytes, last memo index)

__weakref__

list of weak references to the object (if defined)

class mmappickle.picklers.base.GenericPickler(parent_object)[source]

Bases: mmappickle.picklers.base.BasePickler

priority

int(x=0) -> integer int(x, base=10) -> integer

Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating point numbers, this truncates towards zero.

If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-‘ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4

is_valid(offset, length)[source]

Return True if object starting at offset in f is valid.

File position is kept.

is_picklable(obj)[source]

Return True if object can be pickled with this pickler

read(offset, length)[source]

Return the unpickled object read from offset, and the length read. The file position is kept.

write(obj, offset, memo_start_idx=0)[source]

Write the pickled object to the file stream, the file position is kept.

Returns a tuple (number of bytes, last memo index)

mmappickle.utils.save_file_position(f)[source]

Decorator to save the object._file stream position before calling the method

mmappickle.utils.require_writable(f)[source]

Require the object’s _file to be writable, otherwise raise an exception.

mmappickle.utils.lock(f)[source]

Lock the file during the execution of this method. This is a re-entrant lock.

Contributing guide

mmappickle is a free software, and all contributions are welcome, whether they are bug reports, source code, or documentation.

Reporting bugs

To report bugs, open an issue in the issue tracker.

Ideally, a bug report should contain at least the following information:

  • a minimum code example to trigger the bug
  • the expected result
  • the result obtained.

Quick guide to contributing code or documentation

To contribute, you’ll need Sphinx to build the documentation, and pytest to run the tests.

If for some reason you are not able to run the following steps, simply open an issue with your proposed change.

  1. Fork the mmappickle on GitHub.
  2. Clone your fork to your local machine:
git clone https://github.com/<your username>/mmappickle.git
cd mmappickle
pip install -e .
  1. Create a branch for your changes:
git checkout -b <branch-name>
  1. Make your changes.
  2. If you’re writing code, you should write some tests, ensure that all the tests pass and that the code coverage is good. This can be done using:
py.test --cov=mmappickle --pep8
  1. You should also check that the documentation compiles and that the result look good. The documentation can be seen by opening a browser in doc/html. You can (re)build it using the following command line (make sure that there is no warnings):
sphinx-build doc/source doc/html
  1. Commit your changes and push to your fork on GitHub:
git add .
git commit -m "<description-of-changes>"
git push origin <name-for-changes>
  1. Submit a pull request.