Documentation for nbdime – tools for diffing and merging of Jupyter Notebooks¶
NB! The nbdime project and this documentation and is in a very early stage of development and is not usable for any kind of production work yet.
Contents:
Overview of nbdime project¶
The nbdime project aims to provide tools for diffing and merging of Jupyter notebooks.
TODO: Write a better introduction and overview.
Installing and testing nbdime¶
Dependencies¶
- Python version 2.7.1, 3.3, 3.4, 3.5
- six
- nbformat
Note the requirement 2.7.1, not 2.7.0, this is because 2.7.1 fixes a bug in difflib in an interface-breaking way.
Dependencies for running tests:
- pytest
- pytest-cov
Install¶
Use pip to install. See the pip documentation for options. Some examples:
Install requirements for the current user only:
pip install –user –upgrade -r requirements.txt
Install nbdime for the current user only:
pip install –user –upgrade .
Make a local developer install for the current user only:
pip install –user –upgrade -e .
Testing¶
See latest build, test and coverage status at:
To run tests, locally, simply run
py.test
from the project root. If you have python 2 and python 3 installed, you may need to run
python3 -m pytest
to run the tests with python 3. See the pytest documentation for more options.
If you have notebooks with interesting merge challenges, please consider contributing them to nbdime as test cases!
Commandline interface¶
Nbdime provides three CLI commands. See
nbdiff –help nbpatch –help nbmerge –help
for usage details.
Description of the diff representation in nbdime¶
Note: The diff format herein is still considered experimental until development stabilizes. If you have objections or opinions on the format, please raise them ASAP while the project is in its early stages.
In nbdime, the objects to diff are json-compatible nested structures of dicts (with string keys) and lists of values with heterogeneous types (strings, ints, floats). The difference between these objects will itself be represented as a json-compatible object in a format described below.
Diff format basics¶
A diff object represents the difference B-A between two objects A and B as a list of operations (ops) to apply to A to obtain B. Each operation is represented as a dict with at least two items:
{ “op”: <opname>, “key”: <key> }
The objects A and B are either mappings (dicts) or sequences (lists or strings), and a different set of ops are legal for mappings and sequences. Depending on the op, the operation dict usually contains an additional argument, documented below.
Diff format for mappings¶
For mappings, the key is always a string. Valid ops are:
- { “op”: “remove”, “key”: <string> } - delete existing value at key
- { “op”: “add”, “key”: <string>, “value”: <value> } - insert new value at key not previously existing
- { “op”: “replace”, “key”: <string>, “value”: <value> } - replace existing value at key with new value
- { “op”: “patch”, “key”: <string>, “diff”: <diffobject> } - patch existing value at key with another diffobject
Diff format for sequences (list and string)¶
For sequences the key is always an integer index. This index is relative to object A of length N. Valid ops are:
- { “op”: “removerange”, “key”: <string>, “length”: <n>} - delete the values A[key:key+length]
- { “op”: “addrange”, “key”: <string>, “valuelist”: <values> } - insert new items from valuelist before A[key], at end if key=len(A)
- { “op”: “patch”, “key”: <string>, “diff”: <diffobject> } - patch existing value at key with another diffobject
Relation to JSONPatch¶
The above described diff representation has similarities with the JSONPatch standard but is different in a few ways. JSONPatch contains operations “move”, “copy”, “test” not used by nbdime, and nbdime contains operations “addrange”, “removerange”, and “patch” not in JSONPatch. Instead of providing a recursive “patch” op, JSONPatch uses a deep JSON pointer based “path” item in each operation instead of the “key” item nbdime uses. This way JSONPatch can represent the diff object as a single list instead of the ‘tree’ of lists that nbdime uses. To convert a nbdime diff object to the JSONPatch format, use the function
from nbdime.diff_format import to_json_patch jp = to_json_patch(diff_obj)
Note that this function is currently a draft and not covered by tests.
Examples¶
For examples of concrete diffs, see e.g. the test suite in test_patch.py.
Representing merge results and conflicts¶
Nbdime implements a three-way merge of Jupyter notebooks and a large subset of generic json objects. The result of a merge operation with a shared origin object base and modified objects local and remote, is a fully or partially merged object plus diff objects between the partially merged objects and the local and remote objects. These two diff objects represent the merge conflicts that could not be automatically resolved.
TODO: Define output formats for the merge operation.
Notebook specific issues¶
TODO: Document issues covered and plans here.
Use cases and future development plans¶
Fundamentally, we envision use cases mainly in the categories of a merge command for version control integration, and diff command for inspecting changes and automated regression testing. At the core of it all is the diff algorithms, which must handle not only text in source cells but also a number of data formats based on mime types in output cells.
Basic diffing use cases¶
We assume that basic correct diffing is fairly straightforward to implement, but there are still some issues to discuss.
Other tasks (will make issues of these):
- Pretty-printing of diff for commandline output.
- Plugin framework for mime type specific diffing.
- Diffing of common output types (png, svg, etc.)
- Improve fundamental sequence diff algorithm. Current algorithm is based on a brute force O(N^2) longest common subsequence (LCS) algorithm, this will be rewritten in terms of a faster algorithm such as Myers O(ND) LCS based diff algorithm, optionally using Pythons difflib for some use cases where it.
Version control use cases¶
Most commonly, cell source is the primary content, and output can presumably be regenerated. Indee, it is not possible to guarantee that merged sources and merged output is consistent or makes any kind of sense.
Some tasks:
- Merge of output cell content is not planned.
- Is it important to track source lines moving between cells?