Documentation contents¶
Welcome to mmappickle’s documentation!¶
This Python 3 module enables to store large structures in a python pickle
, in a way that the array can be memory-mapped instead of being copied into the memory. This module is licensed under the LGPL3 license.
Currently, the container is a dictionnary (mmappickle.mmapdict
), which keys are unicode strings of less than 256 bytes.
It supports any type of value, but it is only possible to memory map numpy.ndarray
and numpy.ma.MaskedArray
at present.
It also supports concurrent access (i.e. you can pass a mmappickle.mmapdict
as an argument which is called using the multiprocessing
Python module).
Why use mmappickle?¶
Mmappickle
is designed for “unstructured” parallel access, with a strong emphasis on adding new data.
Here are two example uses cases:
Timelapse hyperspectral scanning
At each time interval, an hyperspectral image (which is numpy array of a few hundreds of megabytes) is added to the file, along with the related metadata.
The key features for which
mmappickle
is useful are:- is that it is possible to have an independant viewer process, which is monitoring the file as it is appended, and allows the operator to stop the capture once sufficient data was acquired.
- all the information is kept in one file, which makes it easier to archive and distribute.
- the images are not stored in RAM, which would be a problem due to their size.
Moreover, as the file is a normal
pickle
, it is not required to install anything more than a standard distribution of Python to have it work. This is useful when distributing files to occasional users, which may be less familiar with Python, and may not be able to usepip
to install libraries.Image registration
When having multiples (hyperspectral) images of the same subject, a common requirement is to register these images to allow creating a combined image containing the information of all of them. Depending on the images, this could be for example HDR imaging, or simply stitching the images together.
Commonly, a set of files are provided to the algorithm, which computes keypoints, then the transformation parameters, and finally creates an output file. Due to practical reasons intermediate data is rarely kept, as it means more files, and also that the references to the original files must be strictly kept.
With
mmappickle
, all the input images are in one file. The registration algorithm can simply add the keypoints and the transformation parameters to the original file. This simplifies debugging and also serves as a cache when using successively different combining algorithms.
Quick start¶
mmappickle.mmapdict
behaves like a dictionary. For example:
>>> from mmappickle import mmapdict
>>> m = mmapdict('/tmp/test.mmdpickle') #could be an existing file
>>> m['a_sample_key'] = 'value'
>>> m['other_key'] = [1,2,3]
>>> print(m['a_sample_key'])
value
>>> m['other_key'][2]
3
>>> del m['a_sample_key']
>>> print(m.keys())
['other_key']
The contents of the dictionary are stored to disk. For example, in another python interpreter:
>>> from mmappickle import mmapdict
>>> m = mmapdict('/tmp/test.mmdpickle')
>>> print(m['other_key'])
[1, 2, 3]
It is also possible to open the file in read-only mode, in which case any modification will fail:
>>> from mmappickle import mmapdict
>>> m = mmapdict('/tmp/test.mmdpickle', True)
>>> m['other_key'] = 'a'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/laurent/git/mmappickle/mmappickle/utils.py", line 22, in require_writable_wrapper
raise io.UnsupportedOperation('not writable')
io.UnsupportedOperation: not writable
Of course, the main interest is to store numpy arrays:
>>> import numpy as np
>>> from mmappickle.dict import mmapdict
>>> m = mmapdict('/tmp/test.mmdpickle')
>>> m['test'] = np.array([1,2,3], dtype=np.uint8)
>>> m['test'][1] = 4
>>> print(m['test'])
[1 4 3]
>>> print(type(m['test']))
<class 'numpy.core.memmap.memmap'>
As you can see, the m['test']
is now memory-mapped. This means that its content is not loaded in memory, but instead accessed directly from the file.
Unfortunately, the array has to exist in order to serialize it to the mmapdict
. If the array exceed the available memory, this won’t work. Instead one should use stubs:
>>> from mmappickle.stubs import EmptyNDArray
>>> m['test_large'] = EmptyNDArray((300,300,300))
>>> print(type(m['test_large']))
<class 'numpy.core.memmap.memmap'>
The matrix in m['test_large']
uses 216M of memory, but it was at no point allocated in RAM. This way, it is possible to allocate arrays larger than the size of the memory. One could have written m['test_large'] = np.empty((300,300,300))
, but unfortunately the memory is allocated when calling numpy.empty()
.
Finally, one last useful trick is the mmappickle.mmapdict.vacuum()
method. It allows reclaiming the disk space:
>>> del m['test_large']
>>> #Here, /tmp/test.mmdpickle still occupies ~216M of hard disk
>>> m.vacuum()
>>> #Now the disk space has been reclaimed.
Warning
When running mmappickle.mmapdict.vacuum()
, it is crucial that there are no other references to the file content, either in this process or in other.
In particular, no memory-mapped array. If this rule is not followed, unfortunate outcomes are anticipated! (crash, data corruption, etc.)
Getting help¶
Please use mmappickle issue tracker on GitHub to ask any question.
To report bugs, please see Reporting bugs.
Installation¶
mmappickle
can be installed either using pip
, or from source.
Since mmappickle
requires Python 3
, you should ensure that you’re using the correct pip
executable. On some distributions, the executable is named pip3
.
It is possible to check the Python
version of pip
using:
pip -V
Similarly, you may need to use py.test-3
instead of py.test
.
Manual installation from source¶
To install manually, run the following:
git clone https://github.com/UniNE-CHYN/mmappickle
cd mmappickle
pip install .
To contribute, use pip install -e
instead of pip install
. This sets a link to the source folder in the python installation, instead of copying the files.
It is advisable to run the tests to ensure that everything is working. This can be done by running the following command in the mmappickle
directory:
py.test
API Documentation¶
This is the API documentation of the public API of mmappickle. These are the most commonly used functions when using mmappickle.
Developers interested in extending mmappickle, or in understanding the inner workings should read the following page: Internals.
Main module¶
-
class
mmappickle.
mmapdict
(file, readonly=None, picklers=None)[source]¶ class to access a mmap-able dictionnary in a file.
This class is safe to use in a multi-process environment.
-
__init__
(file, readonly=None, picklers=None)[source]¶ Create or load a mmap dictionnary.
Parameters: - file – either a file-like object or a string representing the name of the file.
- readonly – if
file
is a string, the file will be open in readonly mode if set to True. - picklers – explicit list of picklers. Usually this is not needed (by default, all are used)
-
writable
¶ True if the file is writable, False otherwise
-
commit_number
¶ The monotonically increasing commit number of the
mmapdict
.This is useful to know if the keys have been changed by another process. If the
commit_number
hasn’t changed, it is guaranteed thatkeys()
won’t be changed.Altough it is possible to set the commit number using this property, there is generally no use for this in external code.
-
__contains__
(k)[source]¶ Check if a key exists in dictionnary
Parameters: k – Key (string) to check for existence Returns: True
if key exists in dictionnary,False
otherwise.
-
__weakref__
¶ list of weak references to the object (if defined)
-
__setitem__
(k, v)[source]¶ Create or change key
k
, sets its value tov
.Parameters: - k – key, should be an unicode string of binary length <= 255.
- v – value, any picklable object
When replacing a value, this function adds the new key-value pair at the end of the file, and marks the old one as invalid, but leaves the data in place. As a consequence, this function can be used when using the file concurrently from multiple processes. However, other processes may still be using the old value if they don’t reload the value from the file.
If no concurrent access exists to the file, the old value can be freed using
vacuum()
.
-
__getitem__
(k)[source]¶ Get value for key
k
, raiseKeyError
if the key doesn’t exists in file.If possible, the data will be returned as a mmap’ed object.
-
__delitem__
(k)[source]¶ Mark key
k
as not valid in the file.Parameters: k – key to remove This method marks the key as invalid, but leaves the data in place. As a consequence, this function can be used when using the file concurrently from multiple processes. However, other processes may still be using the value if they don’t reload the keys from the file.
If no concurrent access exists to the file, the old value can be freed using
vacuum()
.
-
vacuum
(chunk_size=1048576)[source]¶ Free all deleted keys, effectively reclaiming disk space.
Only use this function when no mmap exists on the file. Usually it is safer to run it only in part of the code where there is no concurrent access.
Parameters: chunk_size – The size of the buffer used to shift data in the file. Warning
No mmap should exist on this file (both in this python script, and in others), as the data will be shifted.
If an mmap exists, it could crash the process and/or corrupt the file and/or return invalid data.
-
fsck
()[source]¶ Attempt to fix the file, if possible.
This function should be called if some data could not be written to a file. This might be the case if, for example, not enough disk space was available.
This method truncates the file and recreates a valid terminator.
Warning
Calling this function may lead to data loss.
-
Stubs¶
Internals¶
mmapdict
files are pickle
files, containing a dictionary, but with a special format. The main idea is to have a file of predictable structure, to be able to compute the offsets for the memory maps. Moreover, a way to disable a specific key is required, either to replace it or to delete it without changing the offsets of the file.
For example, for the following dictionary:
{'key': 'value', 'test': array([1, 2, 3], dtype=uint8)}
The normal pickle module would output:
0: \x80 PROTO 4
2: \x95 FRAME 172
11: } EMPTY_DICT
12: \x94 MEMOIZE
13: ( MARK
14: \x8c SHORT_BINUNICODE 'test'
20: \x94 MEMOIZE
21: \x8c SHORT_BINUNICODE 'numpy.core.multiarray'
44: \x94 MEMOIZE
45: \x8c SHORT_BINUNICODE '_reconstruct'
59: \x94 MEMOIZE
60: \x93 STACK_GLOBAL
61: \x94 MEMOIZE
62: \x8c SHORT_BINUNICODE 'numpy'
69: \x94 MEMOIZE
70: \x8c SHORT_BINUNICODE 'ndarray'
79: \x94 MEMOIZE
80: \x93 STACK_GLOBAL
81: \x94 MEMOIZE
82: K BININT1 0
84: \x85 TUPLE1
85: \x94 MEMOIZE
86: C SHORT_BINBYTES b'b'
89: \x94 MEMOIZE
90: \x87 TUPLE3
91: \x94 MEMOIZE
92: R REDUCE
93: \x94 MEMOIZE
94: ( MARK
95: K BININT1 1
97: K BININT1 3
99: \x85 TUPLE1
100: \x94 MEMOIZE
101: \x8c SHORT_BINUNICODE 'numpy'
108: \x94 MEMOIZE
109: \x8c SHORT_BINUNICODE 'dtype'
116: \x94 MEMOIZE
117: \x93 STACK_GLOBAL
118: \x94 MEMOIZE
119: \x8c SHORT_BINUNICODE 'u1'
123: \x94 MEMOIZE
124: K BININT1 0
126: K BININT1 1
128: \x87 TUPLE3
129: \x94 MEMOIZE
130: R REDUCE
131: \x94 MEMOIZE
132: ( MARK
133: K BININT1 3
135: \x8c SHORT_BINUNICODE '|'
138: \x94 MEMOIZE
139: N NONE
140: N NONE
141: N NONE
142: J BININT -1
147: J BININT -1
152: K BININT1 0
154: t TUPLE (MARK at 132)
155: \x94 MEMOIZE
156: b BUILD
157: \x89 NEWFALSE
158: C SHORT_BINBYTES b'\x01\x02\x03'
163: \x94 MEMOIZE
164: t TUPLE (MARK at 94)
165: \x94 MEMOIZE
166: b BUILD
167: \x8c SHORT_BINUNICODE 'key'
172: \x94 MEMOIZE
173: \x8c SHORT_BINUNICODE 'value'
180: \x94 MEMOIZE
181: u SETITEMS (MARK at 13)
182: . STOP
highest protocol among opcodes = 4
This works fine, but doesn’t allow random access.
Let’s look at what a mmappickle.dict
file looks like, for the same data:
0: \x80 PROTO 4
2: \x95 FRAME 13
11: J BININT 1
16: 0 POP
17: J BININT 2
22: 0 POP
23: ( MARK
24: \x95 FRAME 20
33: \x8c SHORT_BINUNICODE 'key'
38: \x8c SHORT_BINUNICODE 'value'
45: J BININT 1
50: 0 POP
51: \x88 NEWTRUE
52: 0 POP
53: \x95 FRAME 110
62: \x8c SHORT_BINUNICODE 'test'
68: \x8c SHORT_BINUNICODE 'numpy.core.fromnumeric'
92: \x8c SHORT_BINUNICODE 'reshape'
101: \x93 STACK_GLOBAL
102: \x8c SHORT_BINUNICODE 'numpy.core.multiarray'
125: \x8c SHORT_BINUNICODE 'fromstring'
137: \x93 STACK_GLOBAL
138: \x8e BINBYTES8 b'\x01\x02\x03'
150: \x8c SHORT_BINUNICODE 'uint8'
157: \x86 TUPLE2
158: R REDUCE
159: K BININT1 3
161: \x85 TUPLE1
162: \x86 TUPLE2
163: R REDUCE
164: J BININT 0
169: 0 POP
170: \x88 NEWTRUE
171: 0 POP
172: \x95 FRAME 2
181: d DICT (MARK at 23)
182: . STOP
highest protocol among opcodes = 4
We can note the following changes:
- There are hidden values at the beginning (
version = 1
,file revision = 2
)- Each key-value couple is in an individual frame, which contains a hidden int (memo max index), finally a hidden TRUE.
- The numpy array is created using
numpy.core.fromnumeric.reshape(numpy.core.multiarray.from_string(data, dtype), shape)
instead of the “traditionnal” way
The version
field is used to allow further developments, and is fixed to 1 at present.
The file revision is increased each time a key of the dictionary is changed, to allow caching when there is concurrent access.
Memo max index is used because there may be MEMOIZE/GET/PUT to renumber when pickling values. This is a cache to avoid having to parse all the file.
Finally, the hidden TRUE is a “hack” to allow removing a key. In fact, it is not possible to move data when it’s memmap’ed. To avoid this, the first TRUE is replaced by a POP when deleting the key. In summary, the stack is working in the following way:
- Key exists:
KEY, VALUE, memo max index, POP, TRUE, POP.
(reduced asKEY, VALUE
)- Key doesn’t exist:
KEY, VALUE, memo max index, POP, POP, POP.
(disappears when reduced)
We can see that the file is composed of three differents parts, which are documented below:
- The header (
mmappickle.dict._header
)- Storage of each key-value couple (
mmappickle.dict._kvdata
)- A terminator (
mmappickle.dict._terminator
)
Extending mmappickle
¶
To add support for a new memory mapped value type, one should create a new subclass mmappickle.picklers.base
.
This requires some knowledge of the Python internal pickle format, but should be straightforward, using the numpy picklers as inspiration. Feel free to open an issue if more details are required.
Internal API Documentation¶
-
class
mmappickle.dict.
_header
(mmapdict, _real_header_starts_at=0)[source]¶ The file header is at the beginning of the file.
It consists in the following pickle ops:
PROTO 4 (pickle version 4 header) FRAME <length> BININT <_file_version_number:32> POP (version of the pickle dict, 1) BININT <_file_commit_number:32> POP (commit id of the pickle dict, incremented every time something changes) <additional data depending on the _file_version_number> (none, for version 1) MARK (start of the dictionnary)
-
__init__
(mmapdict, _real_header_starts_at=0)[source]¶ Parameters: - mmapdict – mmapdict object containing the data
- _real_header_starts_at – Offset of the header (normally not used)
-
exists
¶ Returns: True if file contains something
-
commit_number
¶ Commit number (revision) in the file
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
mmappickle.dict.
_terminator
(mmapdict)[source]¶ Terminator is the suffix at the end of the mmapdict file.
It consists is the following pickle ops:
FRAME 2 DICT (make the dictionnary) STOP (end of the file)
-
exists
¶ Returns: True if the file ends with the terminator, False otherwise
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
mmappickle.dict.
_kvdata
(mmapdict, offset)[source]¶ kvdata is the structure holding a key-value data entry.
The trick is that it should be either two values, key and value, or nothing, if the value is deleted.
To do this, we put the key and the value on the stack. Then we either push a NEWTRUE+POP (which results in a NO-OP), or we push a POP+POP (which removes both the key and the value). Since NEWTRUE and POP both have length 1, it is easy to make the substitution.
Another trick is to cache the maximum value of the memoization index (for GET and PUT), to ensure that we have no duplicates.
The _kvdata structure has the following pickle ops:
FRAME <length> SHORT_BINUNICODE <length> <key bytes> <<< data >>> BININT <max memo idx> POP (max memo index of this part) NEWTRUE|POP POP (if NEWTRUE POP: entry is valid, else entry is deactivated.)
-
__init__
(mmapdict, offset)[source]¶ Parameters: - mmapdict – mmapdict object containing the data
- offset – Offset of the key-value data
-
offset
¶ Returns: the offset in the file of the key-value data
-
end_offset
¶ Returns: the end-offset in the file of the key-value data
-
_frame_length
¶ Returns: the frame length for this _kvdata. This is done either by reading it in the file, or by computing it if it doesn’t exist
-
_exists_initial
¶ Returns: True if the file contains the header of the frame
-
data_offset
¶ Returns: the offset of the pickled data
-
key_length
¶ Returns: the binary length of the key
-
_valid_offset
¶ Returns: the offset of the valid byte
-
_memomaxidx_offset
¶ Returns: the offset of the max memo index
-
data_length
¶ Returns: True if the file contains the header of the frame
-
key
¶ Returns: the key as an unicode string
-
memomaxidx
¶ Returns: the (cached) max memo index
-
valid
¶ Returns: True if the key-value couple is valid, False otherwise (i.e. key was deleted)
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
mmappickle.picklers.base.
BasePickler
(parent_object)[source]¶ Bases:
object
Picklers will be attempted in decreasing priority order
-
is_valid
(offset, length)[source]¶ Return True if object starting at offset in f is valid.
File position is kept.
-
read
(offset, length)[source]¶ Return the unpickled object read from offset, and the length read. The file position is kept.
-
write
(obj, offset, memo_start_idx=0)[source]¶ Write the pickled object to the file stream, the file position is kept.
Returns a tuple (number of bytes, last memo index)
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
mmappickle.picklers.base.
GenericPickler
(parent_object)[source]¶ Bases:
mmappickle.picklers.base.BasePickler
-
priority
¶ int(x=0) -> integer int(x, base=10) -> integer
Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating point numbers, this truncates towards zero.
If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-‘ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4
-
is_valid
(offset, length)[source]¶ Return True if object starting at offset in f is valid.
File position is kept.
-
-
mmappickle.utils.
save_file_position
(f)[source]¶ Decorator to save the object._file stream position before calling the method
Contributing guide¶
mmappickle
is a free software, and all contributions are welcome, whether they are bug reports, source code, or documentation.
Reporting bugs¶
To report bugs, open an issue in the issue tracker.
Ideally, a bug report should contain at least the following information:
- a minimum code example to trigger the bug
- the expected result
- the result obtained.
Quick guide to contributing code or documentation¶
To contribute, you’ll need Sphinx to build the documentation, and pytest to run the tests.
If for some reason you are not able to run the following steps, simply open an issue with your proposed change.
- Fork the mmappickle on GitHub.
- Clone your fork to your local machine:
git clone https://github.com/<your username>/mmappickle.git
cd mmappickle
pip install -e .
- Create a branch for your changes:
git checkout -b <branch-name>
- Make your changes.
- If you’re writing code, you should write some tests, ensure that all the tests pass and that the code coverage is good. This can be done using:
py.test --cov=mmappickle --pep8
- You should also check that the documentation compiles and that the result look good. The documentation can be seen by opening a browser in doc/html. You can (re)build it using the following command line (make sure that there is no warnings):
sphinx-build doc/source doc/html
- Commit your changes and push to your fork on GitHub:
git add .
git commit -m "<description-of-changes>"
git push origin <name-for-changes>
- Submit a pull request.