Welcome to pandas-msgpack’s documentation!¶
The pandas_msgpack
module provides an interface from pandas https://pandas.pydata.org
to the msgpack library. This is a lightweight portable
binary format, similar to binary JSON, that is highly space efficient, and provides good performance
both on the writing (serialization), and reading (deserialization).
Contents:
Installation¶
You can install pandas-msgpack with conda
, pip
, or by installing from source.
Conda¶
$ conda install pandas-msgpack --channel conda-forge
This installs pandas-msgpack and all common dependencies, including pandas
.
Pip¶
To install the latest version of pandas-msgpack:
$ pip install pandas-msgpack -U
This installs pandas-msgpack and all common dependencies, including pandas
.
Install from Source¶
$ pip install git+https://github.com/pydata/pandas-msgpack.git
Tutorial¶
In [1]: import pandas as pd
In [2]: from pandas_msgpack import to_msgpack, read_msgpack
In [3]: df = pd.DataFrame(np.random.rand(5,2), columns=list('AB'))
In [4]: to_msgpack('foo.msg', df)
In [5]: read_msgpack('foo.msg')
Out[5]:
A B
0 0.412073 0.117020
1 0.331685 0.341557
2 0.905732 0.131801
3 0.086333 0.710444
4 0.544546 0.980821
In [6]: s = pd.Series(np.random.rand(5),index=pd.date_range('20130101',periods=5))
You can pass a list of objects and you will receive them back on deserialization.
In [7]: to_msgpack('foo.msg', df, 'foo', np.array([1,2,3]), s)
In [8]: read_msgpack('foo.msg')
Out[8]:
[ A B
0 0.412073 0.117020
1 0.331685 0.341557
2 0.905732 0.131801
3 0.086333 0.710444
4 0.544546 0.980821, 'foo', array([1, 2, 3]), 2013-01-01 0.488041
2013-01-02 0.504900
2013-01-03 0.102942
2013-01-04 0.999584
2013-01-05 0.598648
Freq: D, dtype: float64]
You can pass iterator=True
to iterate over the unpacked results
In [9]: for o in read_msgpack('foo.msg',iterator=True):
...: print(o)
...:
A B
0 0.412073 0.117020
1 0.331685 0.341557
2 0.905732 0.131801
3 0.086333 0.710444
4 0.544546 0.980821
foo
[1 2 3]
2013-01-01 0.488041
2013-01-02 0.504900
2013-01-03 0.102942
2013-01-04 0.999584
2013-01-05 0.598648
Freq: D, dtype: float64
You can pass append=True
to the writer to append to an existing pack
In [10]: to_msgpack('foo.msg', df, append=True)
In [11]: read_msgpack('foo.msg')
Out[11]:
[ A B
0 0.412073 0.117020
1 0.331685 0.341557
2 0.905732 0.131801
3 0.086333 0.710444
4 0.544546 0.980821, 'foo', array([1, 2, 3]), 2013-01-01 0.488041
2013-01-02 0.504900
2013-01-03 0.102942
2013-01-04 0.999584
2013-01-05 0.598648
Freq: D, dtype: float64, A B
0 0.412073 0.117020
1 0.331685 0.341557
2 0.905732 0.131801
3 0.086333 0.710444
4 0.544546 0.980821]
Furthermore you can pass in arbitrary python objects.
In [12]: to_msgpack('foo2.msg', { 'dict' : [ { 'df' : df }, { 'string' : 'foo' }, { 'scalar' : 1. }, { 's' : s } ] })
In [13]: read_msgpack('foo2.msg')
Out[13]:
{'dict': ({'df': A B
0 0.412073 0.117020
1 0.331685 0.341557
2 0.905732 0.131801
3 0.086333 0.710444
4 0.544546 0.980821},
{'string': 'foo'},
{'scalar': 1.0},
{'s': 2013-01-01 0.488041
2013-01-02 0.504900
2013-01-03 0.102942
2013-01-04 0.999584
2013-01-05 0.598648
Freq: D, dtype: float64})}
Compression¶
Optionally, a compression
argument will compress the resulting bytes.
These can take a bit more time to write. The available compressors are
zlib
and blosc.
Generally compression will increase the writing time.
In [1]: import pandas as pd
In [2]: from pandas_msgpack import to_msgpack, read_msgpack
In [3]: df = pd.DataFrame({'A': np.arange(100000),
...: 'B': np.random.randn(100000),
...: 'C': 'foo'})
...:
In [4]: %timeit -n 1 -r 1 to_msgpack('uncompressed.msg', df)
1 loop, best of 1: 26.9 ms per loop
In [5]: %timeit -n 1 -r 1 to_msgpack('compressed_blosc.msg', df, compress='blosc')
1 loop, best of 1: 27.2 ms per loop
In [6]: %timeit -n 1 -r 1 to_msgpack('compressed_zlib.msg', df, compress='zlib')
1 loop, best of 1: 135 ms per loop
If compressed, it will be be automatically inferred and de-compressed upon reading.
In [7]: %timeit -n 1 -r 1 read_msgpack('uncompressed.msg')
1 loop, best of 1: 21.3 ms per loop
In [8]: %timeit -n 1 -r 1 read_msgpack('compressed_blosc.msg')
1 loop, best of 1: 21.1 ms per loop
In [9]: %timeit -n 1 -r 1 read_msgpack('compressed_zlib.msg')
1 loop, best of 1: 29.4 ms per loop
These can provide storage space savings.
In [10]: !ls -ltr *.msg
-rw-r--r-- 1 docs docs 2000582 Apr 1 15:36 uncompressed.msg
-rw-r--r-- 1 docs docs 1187916 Apr 1 15:36 compressed_blosc.msg
-rw-r--r-- 1 docs docs 1320539 Apr 1 15:36 compressed_zlib.msg
Read/Write API¶
Msgpacks can also be read from and written to strings.
In [1]: import pandas as pd
In [2]: from pandas_msgpack import to_msgpack, read_msgpack
In [3]: df = pd.DataFrame({'A': np.arange(10),
...: 'B': np.random.randn(10),
...: 'C': 'foo'})
...:
In [4]: to_msgpack(None, df)
Out[4]: b"\x84\xa4axes\x92\x86\xa5dtype\xa6object\xa5klass\xa5Index\xa4data\x93\xa1A\xa1B\xa1C\xa8compress\xc0\xa4name\xc0\xa3typ\xa5index\x86\xa4stop\n\xa5klass\xaaRangeIndex\xa4name\xc0\xa5start\x00\xa3typ\xabrange_index\xa4step\x01\xa5klass\xa9DataFrame\xa6blocks\x93\x86\xa5dtype\xa7float64\xa5klass\xaaFloatBlock\xa6values\xc7P\x00\x7f\xa2\xb4\xacXu\xd5\xbfs\xf1$\xc8\x03\xa8\xcd?\x1a\xaa\xc0\x1a\x8fw\xfb\xbf\xaa\xf9\xcd\r/z\x90\xbf\x12\xea\x0e\x8a7\xa7\xe1?\xb9\xfb{\xa2YM\xf5?sDY\xc1\xcbd\xd8?|\xd1P [u\xa5?\x8d'\xd3u=\xc6\xc6\xbf\xaa;\xa4\xe7U\xa3\xd0?\xa5shape\x92\x01\n\xa8compress\xc0\xa4locs\x86\xa5dtype\xa5int64\xa3typ\xa7ndarray\xa5shape\x91\x01\xa4data\xd7\x00\x01\x00\x00\x00\x00\x00\x00\x00\xa8compress\xc0\xa4ndim\x01\x86\xa5dtype\xa5int64\xa5klass\xa8IntBlock\xa6values\xc7P\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x06\x00\x00\x00\x00\x00\x00\x00\x07\x00\x00\x00\x00\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\x00\x00\xa5shape\x92\x01\n\xa8compress\xc0\xa4locs\x86\xa5dtype\xa5int64\xa3typ\xa7ndarray\xa5shape\x91\x01\xa4data\xd7\x00\x00\x00\x00\x00\x00\x00\x00\x00\xa8compress\xc0\xa4ndim\x01\x86\xa5dtype\xa6object\xa5klass\xabObjectBlock\xa6values\x9a\xa3foo\xa3foo\xa3foo\xa3foo\xa3foo\xa3foo\xa3foo\xa3foo\xa3foo\xa3foo\xa5shape\x92\x01\n\xa8compress\xc0\xa4locs\x86\xa5dtype\xa5int64\xa3typ\xa7ndarray\xa5shape\x91\x01\xa4data\xd7\x00\x02\x00\x00\x00\x00\x00\x00\x00\xa8compress\xc0\xa4ndim\x01\xa3typ\xadblock_manager"
Furthermore you can concatenate the strings to produce a list of the original objects.
In [5]: read_msgpack(to_msgpack(None, df) + to_msgpack(None, df.A))
Out[5]:
[ A B C
0 0 -0.335287 foo
1 1 0.231690 foo
2 2 -1.716689 foo
3 3 -0.016091 foo
4 4 0.551662 foo
5 5 1.331384 foo
6 6 0.381152 foo
7 7 0.041911 foo
8 8 -0.177925 foo
9 9 0.259969 foo, 0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
Name: A, dtype: int64]
API Reference¶
read_msgpack (path_or_buf[, encoding, iterator]) |
Load msgpack pandas object from the specified |
to_msgpack (path_or_buf, *args, **kwargs) |
msgpack (serialize) object to input file path |
-
pandas_msgpack.
read_msgpack
(path_or_buf, encoding='utf-8', iterator=False, **kwargs)¶ Load msgpack pandas object from the specified file path
Parameters: path_or_buf : string File path, BytesIO like or string
encoding: Encoding for decoding msgpack str type
iterator : boolean, if True, return an iterator to the unpacker
(default is False)
Returns: obj : type of object stored in file
-
pandas_msgpack.
to_msgpack
(path_or_buf, *args, **kwargs)¶ msgpack (serialize) object to input file path
Parameters: path_or_buf : string File path, buffer-like, or None
if None, return generated string
args : an object or objects to serialize
encoding: encoding for unicode objects
append : boolean whether to append to an existing msgpack
(default is False)
compress : type of compressor (zlib or blosc), default to None (no
compression)
Changelog¶
0.1.4 / 2017-03-30¶
Initial release of transfered code from pandas
Includes patches since the 0.19.2 release on pandas with the following:
- Bug in
read_msgpack()
in whichSeries
categoricals were being improperly processed, see pandas-GH#14901 - Bug in
read_msgpack()
which did not allow loading of a dataframe with an index of typeCategoricalIndex
, see pandas-GH#15487 - Bug in
read_msgpack()
when deserializing aCategoricalIndex
, see pandas-GH#15487