Welcome to Data Hub API’s documentation!

Contents:

Data Hub API Overview

API for the UKTI Data Hub.

Official docs on Read the Docs

Dependencies

Installation

Clone the repository:

git clone git@github.com:UKTradeInvestment/data-hub-api.git

Next, create the environment and start it up:

cd data-hub-api
virtualenv env --python=python3.5

source env/bin/activate

Update pip to the latest version:

pip install -U pip

Install python dependencies:

pip install -r requirements/local.txt

Create the database in postgres called data-hub-api.

For OSX, update the PATH and DYLD_LIBRARY_PATH environment variables if necessary:

export PATH="/Applications/Postgres.app/Contents/MacOS/bin/:$PATH"
export DYLD_LIBRARY_PATH="/Applications/Postgres.app/Contents/MacOS/lib/:$DYLD_LIBRARY_PATH"

Create a local.py settings file from the example file and set the CDMS settings/credentials:

cp data-hub-api/settings/local.example.py data-hub-api/settings/local.py

Sync and migrate the database:

./manage.py migrate

Start the server:

./manage.py runserver 8000

CDMS Sync

The problem, the options and our approach

The problem

We are migrating away from Microsoft Dynamics 2011 (CDMS) and decided to build a new CRM system (Data Hub) using a gradual incremental approach.

During a period of several months, the following constraints apply:

  • data between CDMS and Data Hub needs to be kept in sync
  • Data Hub needs allow re-modeling by adding/removing types/properties
  • some users would continue to use CDMS whilst we transition from one system to the other

The options

We considered different approaches including:

  • use CDMS as data store and access it directly. This has many disadvantages including hosting CDMS, not being able to easily change the schemas, architecture complexity etc.
  • use two data stores with some sort of low level synchronization (via database or processes). This as well has many disadvantages including integrating with old technologies (Dynamics 2011), two separate layers (code and sync logic) depending on each other tightly and hard to manage, synchronisation conflicts etc.
  • use two data stores with code-managed synchronization. This is the chosen architecture and has some disadvantages as well that we will explain later.

The chosen approach

Two data stores with reads and writes to CDMS happening as usual and synchronisation triggered from io actions in Data Hub.

Writes to Data Hub will:
  • get the object from CDMS (if it exists)
  • apply the changes and write to CDMS
  • apply the changes in Data Hub
Reads from Data Hub will:
  • get the object from the Data Hub data store
  • get the related object from CDMS
  • check if CDMS was updated after the last synchronisation
  • if so, update the Data Hub object
  • return the local results

Read and write operations are performed as a single transaction so that changes are rolled back in case of exceptions with CDMS.

The same object on both systems is considered in sync if the modified field value is the same. If the modified value of the CDMS version is more recent, it means that the Data Hub object has to be updated from the CDMS one. If the modified value of the Data Hub version is more recent, an exception is triggered as this should never happen. This is because writes on the Data Hub always generate writes in CDMS but the vice versa is obviously not true.

The possibility of conflicts is low as:

  • objects on the two systems are kept in sync via the modified field updated after each CDMS get
  • concurrent operations to a single object are low or non-existent in volume

In case two updates happen at approximately the same time, the last one wins. This should not be a problem as the system keeps a history of the changes.

Limitations

There are some limitations in using this approach:

  • Amount of requests. This has not been measured yet but could (and should) be partially addressed by using some sort of caching strategy
  • The synchronisation happens using one common CDMS user
  • Some Django ORM API cannot be easily implemented. E.g. Model.objects.count(), Model.objects.filter(field1__field2='something'). This is mainly because of the old CDMS technologies
  • It might not be easy to change the Django schema in many cases as the sync layer prefers a one-to-one mapping.

Integration instructions

How it works

A custom django manager / queryset intercepts reads / writes and takes care of all the CDMS operations. This means that developers can ignore this extra complexity and use the django orm api as usual.

That being said, only a subset of the django orm api have been implemented and are even possible. Check Django ORM integration for the full list of the ORM calls supported.

Project setup

The cdms_api app contains the CDMS API library whilst the migrator app defines all the code needed for the synchronisation.

Note

It’s really important that you keep all the logic related to the CDMS sync in one place and to a minimum so that it’s easy to get rid of it when it’s time to shut down CDMS and delete the sync layer altogether.

For this reason, we decided to keep this logic in the migrator app and in one single file per django app (conventionally called cdms_migrator.py).

Your app and your Django model

  1. Set up Django app/model
Create a new Django app or simply a new Django Model as needed.
  1. CDMSMigrator

In a module called <your-app>/cdms_migrator.py , subclass migrator.cdms_migrator.BaseCDMSMigrator and define the mapping fields and the CDMS service.

Add the CDMSMigrator to the Django Model as per step 3.

  1. Configure your model

Change your model so that it looks like the one below:

Note

for Foreign key fields, you should use core.fields.UKTIForeignKey instead of the Django one.

from reversion import revisions as reversion

from django.db import models

from core.models import CRMBaseModel
from core.managers import CRMManager

from .cdms_migrator import MyModelMigrator

@reversion.register()
class MyModel(CRMBaseModel):
    ....

    objects = CRMManager()
    cdms_migrator = MyModelMigrator()
  1. Create a migration for your model as usual
./manage.py makemigrations
./manage.py migrate

CDMSMigrator

The mapping between your model and the CDMS one is defined in your model’s CDMSMigrator class which should be in <your-app>/cdms_migrator.py.

Extend the migrator.cdms_migrator.BaseCDMSMigrator class and define the fields and service attributes.

For example

from cdms_api import fields as cdms_fields

from migrator.cdms_migrator import BaseCDMSMigrator


class OrganisationMigrator(BaseCDMSMigrator):
    fields = {
        'name': cdms_fields.StringField('Name'),
        'uk_organisation': cdms_fields.BooleanField('optevia_ukorganisation'),
        ...
    }
    service = 'Account'  # this is the Dynamics resource name

Django ORM integration

Operations that cause synchronisation

.filter(...) operations make a CDMS API call to get the CDMS objects with the same translated filtering, refresh the local objects by updating or creating them and then return the standard Django results.

.get(...) operations get the object in local and in CDMS, compare the two, update the local one if needed and then return the standard Django result.

.create(...) or .save() operations create the object in local and in CDMS. In case of exceptions with CDMS the local changes are rolled back.

.save() operations update the object in local and in CDMS. In case of exceptions with CDMS the local changes are rolled back.

.delete() operations delete the object in local and in CDMS. In case of exceptions with CDMS the local changes are rolled back.


✔ Supported ✘ Not supported

Lookups
API Description
✔ Klass.objects.filter(field__exact=...)  
✔ Klass.objects.filter(field__iexact=...)  
✔ Klass.objects.filter(field__contains=...)  
✔ Klass.objects.filter(field__icontains=...)  
✘ Klass.objects.filter(field__in=...)  
✔ Klass.objects.filter(field__gt=...)  
✔ Klass.objects.filter(field__gte=...)  
✔ Klass.objects.filter(field__lt=...)  
✔ Klass.objects.filter(field__lte=...)  
✔ Klass.objects.filter(field__startswith=...)  
✔ Klass.objects.filter(field__istartswith=...)  
✔ Klass.objects.filter(field__endswith=...)  
✔ Klass.objects.filter(field__iendswith=...)  
✘ Klass.objects.filter(field__range=...)  
✔ Klass.objects.filter(field__year=...)  
✔ Klass.objects.filter(field__day=...)  
✘ Klass.objects.filter(field__week_day=...)  
✔ Klass.objects.filter(field__hour=...)  
✔ Klass.objects.filter(field__minute=...)  
✔ Klass.objects.filter(field__second=...)  
✘ Klass.objects.filter(field__isnull=...) Not yet implemented but we should really support it.
✘ Klass.objects.filter(field__search=...)  
✘ Klass.objects.filter(field__regex=...)  
✘ Klass.objects.filter(field__iregex=...)  
Filtering
API Description
✔ Klass.objects.all() It only syncs the top 50 objects from CDMS as it would be infeasible to sync all of them.
✔ Klass.objects.filter(field=...)  
✔ Klass.objects.filter(Q(field=...))  
✔ Klass.objects.filter(field1=..., field2=...)  
✔ Klass.objects.filter(Q(field1=...) & Q(field2=...))  
✔ Klass.objects.filter(Q(field1=...) | Q(field2=...))  
✔ Klass.objects.filter(field1=...).filter(field2=...)  
✔ Klass.objects.filter(Q(Q(field1=...) & Q(field2=...)) & Q(field3=...))  
✔ Klass.objects.exclude(field=...)  
✔ Klass.objects.exclude(field1=..., field2=...)  
✔ Klass.objects.exclude(field1=...).exclude(field2=...)  
✔ Klass.objects.exclude(Q(field1=...) | Q(field2=...))  
✔ Klass.objects.exclude(Q(field1=...) & Q(field2=...))  
✔ Klass.objects.filter(field1=...).exclude(field2=...)  
✔ Klass.objects.filter(Q(field1=...) | Q(field2=...)).exclude(Q(field3=...) & Q(field4=...))  
Order by
API Description
✔ Klass.objects.all().order_by(‘field’)  
✔ Klass.objects.all().order_by(‘-field’)  
✔ Klass.objects.all().order_by(‘field1’, ‘-field2’)  
✘ Klass.objects.all().order_by(‘?’)  
Get
API Description
✔ Klass.objects.get(pk=...) Gets the obj from local, the one in CDMS, compares the two and updates the local before returning it if necessary
✔ Klass.objects.get(cdms_pk=...) Gets the obj from local or CDMS if doesn’t exist in local, updates or creates the local before returning it if necessary
✔ Klass.objects.get(field=...) Like .get(pk=...)
Create
API Description
✔ obj = Klass(field=...); obj.save()  
✔ Klass.objects.create(field=...)  
✘ Klass.objects.bulk_create(...)  
Update
API Description
✔ obj.save()  
✘ Klass.objects.filter(field=...).update(...)  
✘ Klass.objects.select_for_update(...)  
Delete
API Description
✔ obj.delete()  
✘ Klass.objects.filter(field=...).delete()  
Misc
API Description
✘ Klass.objects.annotate(...)  
✘ Klass.objects.reverse(...)  
✘ Klass.objects.distinct(...)  
✘ Klass.objects.values(...)  
✘ Klass.objects.values_list(...)  
✘ Klass.objects.dates(...)  
✘ Klass.objects.datetimes(...)  
✔ Klass.objects.none()  
✘ Klass.objects.select_related(...)  
✘ Klass.objects.prefetch_related(...)  
✘ Klass.objects.extra(...)  
✘ Klass.objects.defer(...)  
✘ Klass.objects.only(...)  
✘ Klass.objects.raw(...)  
✘ Klass.objects.get_or_create(...)  
✘ Klass.objects.update_or_create(...)  
✘ Klass.objects.count(...)  
✘ Klass.objects.in_bulk(...)  
✘ Klass.objects.latest(...)  
✘ Klass.objects.earliest(...)  
✘ Klass.objects.first(...)  
✘ Klass.objects.last(...)  
✘ Klass.objects.aggregate(...)  
✘ Klass.objects.exists(...)  

Operations that skip synchronisation

Most of the time, you can skip CDMS operations by using the skip_cdms() method on the manager or the skip_cdms param on the save/delete methods.

Note

Do not skip the cdms operations when writing as the objects would then become out of sync. If this is really required, maybe we need to rename the modified field into something like cdms_modified and have a different one for modified.

✔ Supported ✘ Not supported

Filtering
API Description
✔ Klass.objects.skip_cdms().all()  
✔ Klass.objects.skip_cdms().filter(...)  
✔ Klass.objects.skip_cdms().exclude(...)  
✔ Klass.objects.skip_cdms().all().order_by(...)  
Get
API Description
✔ Klass.objects.skip_cdms().get()  
Create
API Description
✔ obj = Klass(field=...); obj.save(skip_cdms=True)  
✔ Klass.objects.skip_cdms().create(field=...)  
✔ Klass.objects.skip_cdms().bulk_create(field=...)  
Update
API Description
✔ obj.save(skip_cdms=True)  
✔ Klass.objects.skip_cdms().filter(field=...).update(...)  
✔ Klass.objects.skip_cdms().select_for_update(...)  
Delete
API Description
✔ obj.delete(skip_cdms=True)  
✔ Klass.objects.skip_cdms().filter(field=...).delete()  
Misc
API Description
✔ Klass.objects.skip_cdms().annotate(...)  
✔ Klass.objects.skip_cdms().reverse(...)  
✔ Klass.objects.skip_cdms().distinct(...)  
✔ Klass.objects.skip_cdms().values(...)  
✔ Klass.objects.skip_cdms().values_list(...)  
✔ Klass.objects.skip_cdms().dates(...)  
✔ Klass.objects.skip_cdms().datetimes(...)  
✔ Klass.objects.skip_cdms().none()  
✔ Klass.objects.skip_cdms().select_related(...)  
✘ Klass.objects.skip_cdms().prefetch_related(...)  
✔ Klass.objects.skip_cdms().extra(...)  
✔ Klass.objects.skip_cdms().defer(...)  
✔ Klass.objects.skip_cdms().only(...)  
✔ Klass.objects.skip_cdms().raw(...)  
✔ Klass.objects.skip_cdms().get_or_create(...)  
✔ Klass.objects.skip_cdms().update_or_create(...)  
✔ Klass.objects.skip_cdms().count(...)  
✔ Klass.objects.skip_cdms().in_bulk(...)  
✔ Klass.objects.skip_cdms().latest(...)  
✔ Klass.objects.skip_cdms().earliest(...)  
✔ Klass.objects.skip_cdms().first(...)  
✔ Klass.objects.skip_cdms().last(...)  
✔ Klass.objects.skip_cdms().aggregate(...)  
✔ Klass.objects.skip_cdms().exists(...)  

Revisions

We use django-reversion for creating revisions and versions.

How django-reversion works

django-reversion uses revisions and versions.

Revisions are blocks of code where some changes happen. One or more objects could potentially change in the same block.

Versions are changes to an object in a given revision. Versions always have a foreign key to the related revision.

Revisions can have the following metadata:

  • user: who made the changes
  • comment: optional text

Metadata has to be set manually for obvious reasons.

Usually you implement django-reversion in various ways:

  • via the admin integration so that every time a user uses the admin, changes are saved automatically
  • via an explicit context manager with the possibility to set metadata programmatically

How django-reversion is used

As we wanted to create revisions/versions automatically and not lose any changes, we implemented django-reversion at a lower level.

In our system we have 2 types of changes:

  • CDMS refresh changes: where we refresh a local object (update or create) from CDMS. This happens automatically by creating a version of the object with the comment CDMS refresh.
  • local changes: where we make a change to the objects of our system. This happens every time the .save() method is called and it’s automatic.

Note

As we can’t access the user automatically, we are currently not setting the related metadata on the revision. We need to look into this, it might just be a matter of using the context manager in API views.

Shutting down CDMS

If you are reading this it means that it’s probably time to shut down CDMS and get rid of all that crazy sync shit. Congratulations and well done!

Hopefully, the past developers made your life easier and removing all dependencies means that you only need to:

  • change core.models.CRMBaseModel so that it extends core.lib_models.TimeStampedModel instead of migrator.models.CDMSModel
  • change core.models.managers.CDMSManager so that it extends the django default manager instead of migrator.managers.CDMSManager
  • delete the migrator and the cdms_api apps
  • delete the cdms_migrator file in every django app
  • clean up the settings with all unused values
  • run makemigrations and migrate