Combine 🚜¶
Overview¶
Combine is an application to facilitate the harvesting, transformation, analysis, and publishing of metadata records by Service Hubs for inclusion in the Digital Public Library of America (DPLA).
The name “Combine”, pronounced /kämˌbīn/
with a long i
, is a nod to the combine harvester used in farming used for, “combining three separate harvesting operations - reaping, threshing, and winnowing - into a single process” Instead of grains, we have metadata records! These metadata records may come in a variety of metadata formats, various states of transformation, and may or may not be valid in the context of a particular data model. Like the combine equipment used for farming, this application is designed to provide a single point of interaction for multiple steps along the way of harvesting, transforming, and analyzing metadata in preperation for inclusion in DPLA.
Installation¶
Combine has a fair amount of server components, dependencies, and configurations that must be in place to work, as it leverages Apache Spark, among other applications, for processing on the backend.
To this end, a separate GitHub repository, Combine-playbook, has been created to assist with provisioning a server with everything neccessary, and in place, to run Combine. This repository provides routes for server provisioning via Vagrant and/or Ansible. Please visit the Combine-playbook repository for more information about installation.
Table of Contents¶
If you just want to kick the tires, the QuickStart guide provides a walkthrough of harvesting, transforming, and publishing some records, that lays the groundwork for more advanced analysis.
QuickStart¶
Notes and Update¶
- This QuickStart guide provides a high level walkthrough of harvesting records from static files, transforming those records with XSLT, and publishing via Combine’s built-in OAI server.
- As of 9/20/2018, with
v0.3
on the horizon for release, this quickstart is becoming very outdated. Goal is to update soon, but in the interim, proceed at your own peril!
Overview¶
Sometimes you can’t beat kicking the tires to see how an application works. This “QuickStart” guide will walkthrough the harvesting, transforming, and publishing of metadata records in Combine, with some detours for explanation.
Demo data from unit tests will be reused here to avoid the need to provide actual OAI-PMH endpoints, Transformation or Validation scenarios, or other configurations unique to a DPLA Service Hub.
This guide will walk through the following areas of Combine, and it’s recommended to do so in order:
- sshing into server
- python environment
- starting / stopping Combine
- Combine data model
- Organizations, RecordGroups, Jobs, Records
- configuration
- setting up OAI-PMH endpoints
- creating Transformation Scenarios
- creating Validation Scenarios
- harvesting Records
- transforming Records
- looking at Jobs and Records
- duplicating / merging Jobs
- publishing Records
- analysis jobs
- troubleshooting
For simplicity’s sake, we will assume Combine is installed on a server
with the domain name of combine
, though likely running at the IP
192.168.45.10
, which the Ansible/Vagrant install from
Combine-Playbook
defaults to. On most systems you can point that IP to a domain name like
combine
by modifying your /etc/hosts
file on your local machine.
Note: combine
and 192.168.45.10
might be used
interchangeably throughout.
SSHing into server¶
The most reliable way is to ssh in as the combine
user (assuming
server at 192.168.45.10
), password is also combine
:
# username/password is combine/combine
ssh combine@192.168.45.10
You can also use Vagrant to ssh in, from the Vagrant directory on the host machine:
vagrant ssh
If using Vagrant to ssh in, you’ll want to switch users and become
combine
, as most things are geared for that user.
Combine python environment¶
Combine runs in a Miniconda python environement, which can be activated from any filepath location by typing:
source activate combine
Note: Most commands in this QuickStart guide require you to be in this environment.
Starting / Stopping Combine¶
Gunicorn¶
For normal operations, Combine is run using
Supervisor, a python based application for
running system processes. Specifically, it’s running under the Python
WSGI server Gunicorn, under the supervisor
program named gunicorn
.
Start Combine:
sudo supervisorctl start gunicorn
Stop Combine:
sudo supervisorctl stop gunicorn
You can confirm that Combine is running by visiting
http://192.168.45.10/combine, where
you should be prompted to login with a username and password. For
default/testing installations, you can use combine
/ combine
for
these credentials.
Django runserver¶
You can also run Combine via Django’s built-in server.
Convenience script, from /opt/combine
:
./runserver.sh
Or, you can run the Django command explicitly from /opt/combine
:
./manage.py runserver --noreload 0.0.0.0:8000
You can confirm that Combine is running by visiting
http://192.168.45.10:8000/combine (note the 8000
port number).
Livy Sessions¶
To run any Jobs, Combine relies on an active (idle) Apache Livy session. Livy is what makes running Spark jobs possible via the familiar request/response cycle of a Django application.
Currently, users are responsible for determining if the Livy session is ready, though there are plans to have this automatically handled.
To check and/or start a new Livy session, navigate to:
http://192.168.45.10/combine/system. The important column is
status
which should read idle
. If not, click Stop
or
Remove
under the actions
column, and once stopped, click the
start new session
link near the top. Takes anywhere from 10-20
seconds to become idle
.
You can check the status of the Livy session at a glance from the
Combine navigation, where Livy/Spark
next to System
should have a green background if active.
Combine Data Model¶
Organization¶
The highest level of organization in Combine is an Organization. Organizations are intended to group and organize records at the level of an institution or organization, hence the name.
You can create a new Organization from the Organizations page at Organizations page, or by clicking “Organizations” from navigation links at the top of any page.
For this walkthrough, we can create one with the name “Amazing
University”. Only the name
field is required, others are optional.
RecordGroup¶
Within Organizations are RecordGroups. RecordGroups are a “bucket” at the level of a bunch of intellectually similar records. It is worth noting now that a single RecordGroup can contain multiple Jobs, whether they are failed or incomplete attempts, or across time. Suffice it to say for now that RecordGroups may contain lots of Jobs, which we will create here in a minute through harvests, transforms, etc.
For our example Organization, “Amazing University”, an example of a
reasonable RecordGroup might be this fictional University’s Fedora
Commons based digital repository. To create a new RecordGroup, from the
Organizations page,
click on the Organization “Amazing University” from the table. From the
following Organiation page for “Amazing University” you can create a new
RecordGroup. Let’s call it “Fedora Repository”; again, no other fields
are required beyond name
.
Finally, click into the newly created RecordGroup “Fedora Repository” to see the RecordGroup’s page, where we can begin to run Jobs.
Jobs¶
Central to Combine’s workflow philosophy are the ideas of Jobs. Jobs include any of the following:
- Harvest (OAI-PMH, static XML, and others to come)
- Transform
- Merge/Duplicate
- Publish
- Analysis
Within the context of a RecordGroup, one can think of Jobs as “stages” of a group of records, one Job serving as the input for the next Job run on those records. i.e.
OAI-PMH Harvest Job ---> XSLT Transform Job --> Publish Job
Record¶
Lastly, the most granular major entity in Combine is an individual Record. Records exist within a Job. When a Job is deleted, so are the Records (the same can be said for any of these hierarchies moving up). Records will be created in the course of running Jobs.
Briefly, Records are stored in MySQL, and are indexed in ElasticSearch. In MySQL, you will find the raw Record XML metadata, and other information related to the Record throughout various stages in Combine. In ElasticSearch, you find an flattened, indexed form of the Record’s metadata, but nothing much more. The representation of a Record in ElasticSearch is almost entirely for analysis and search, but the transactional nature of the Record through various stages and Jobs in Combine is the Record as stored in MySQL.
It is worth noting, though not dwelling on here, that groups of Records are also stored as Avro files on disk.
Configuration and Scenarios¶
Combine relies on users coniguring “scenarios” that will be used for things like transformations, validations, etc. These can be viewed, modified, and tested in the Configuration page. This page includes the following main sections:
- Field Mapper Configurations
- OAI-PMH endpoints
- Transformation Scenarios
- Validation Scenarios
- Record Identifier Transformation Scenarios
- DPLA Bulk Data Downloader
For the sake of this QuickStart demo, we can bootstrap our instance of Combine with some demo configurations, creating the following:
- Transformation Scenario
- “MODS to Service Hub profile” (XSLT transformation)
- Validation Scenarios
- “DPLA minimum” (schematron validation)
- “Date checker” (python validation)
To boostrap these demo configurations for the purpose of this
walkthrough, run the following command from /opt/combine
:
./manage.py quickstartbootstrap
You can confirm these demo configurations were created by navigating to the configuration screen at http://192.168.45.10/combine/configurations.
Harvesting Records¶
Static XML harvest¶
Now we’re ready to run our first Job and generate our first Records. For this QuickStart, as we have not yet configured any OAI-PMH endpoints, we can run a static XML harvest on some demo data included with Combine.
From the RecordGroup screen, near the bottom and under “Harvest”, click “Static XML”.
You will be presented with a screen to run a harvest job of static XML files from disk:
Many fields are optional – e.g. Name, Description – but we will need to tell the Harvest Job where to find the files.
First, click the tab “Filesystem”, then for the form field
Location of XML files on disk:
, enter the following, which points to
a directory of 250 MODS files (this was created during bootstrapping):
/tmp/combine/qs/mods
Next, we need to provide an XPath query that locates each discrete
record within the provided MODS file. Under the section “Locate Document”, for the
form field Root XML Element
, enter the following:
/mods:mods
For the time being, we can ignore the section “Locate Identifier in Document” which would allow us to find a unique identifier via XPath in the document. By default, it will assign a random identifier based on a hash of the document string.
Next, we can apply some optional parameters that are present for all jobs in Combine. This looks like the following:
Different parameter types can be found under the various tabs, such as:
- Field Mapping Configuration
- Validation Tests
- Transform Identifier
- etc.
Most of these settings we can leave as deafult for now, but one optional parameter we’ll want to check and set for this initial job are Validations to perform on the records. These can be found under the “Validation Tests” tab. If you bootstrapped the demo configurations from steps above, you should see two options, DPLA minimum and Date checker; make sure both are checked.
Finally, click “Run Job” at the bottom.
This should return you to the RecordGroup page, where a new Job has
appeared and is running
under the Status
column in the Job
table. A static job of this size should not take long, refresh the page
in 10-20 seconds, and hopefully, you should see the Job status switch to
available
.
This table represents all Jobs run for this RecordGroup, and will grow
as we run some more. You may also note that the Is Valid
column is
red and shows False
, meaning some records have failed the Validation
Scenarios we optionally ran for this Job. We will return to this later.
For now, let’s continue by running an XSLT Transformation on these records.
Transforming Records¶
In the previous step, we harvestd 250 records from a bunch of static MODS XML documents. Now, we will transform all the Records in that Job with an XSLT Transformation Scenario.
From the RecordGroup screen, click the “Transform” link at the bottom.
For a Transform job, you are presented with other Jobs from this RecordGroup that will be used as an input job for this Transformation.
Again, Job Name
and Job Note
are both optional. What is
required, is selecting what job will serve as the input Job for this
Transformation. In Combine, most Jobs take a previous job as an input,
essentially performing the current Job over all records from the
previous job. In this way, as Records move through Jobs, you get a
series of “stages” for each Record.
An input Job can be selected for this Transform Job by clicking the radio button next to the job in the table of Jobs (at this stage, we likely only have the one Harvest Job we just ran).
Next, we must select a Transformation Scenario to apply to the records from the input Job. We have a Transformation Scenario prepared for us from the QuickStart bootstrapping, but this is where you might optionally select different transforms depending on your task at hand. While only one Transformation Scenario can be applied to a single Transform job, multiple Transformation Scenarios can be prepared and saved in advance for use by all users, ready for different needs.
For our purposes here, select MODS to Service Hub profile (xslt)
from the dropdown:
Once the input Job (radio button from table) and Transformation Scenario (dropdown) are selected, we are presented with the same optional parameters as we saw in the previous, Harvest Job. We can leave the defaults again, double checking that the two Validation Scenarios – DPLA minimum and Date checker – are both checked under the “Validation Tests” tab.
When running Jobs, we also have the ability to select subsets of Records from input Jobs. Under the tab “Record Input Filter”, you can refine the Records that will be used in the following ways:
- Refine by Record Validity: Select Records based on their passing/failing of Validation tests
- Limit Number of Records: Select a numerical subset of Records, helpful for testing
- Refine by Mapped Fields: Most exciting, select subsets of Records based on an ElasticSearch query run against those input Jobs mapped fields
For the time being, we can leave these as default. Finally, click “Run Job” at the bottom.
Again, we are kicked back to the RecordGroup screen, and should
hopefully see a Transform job with the status running
. Note: The
graph on this page near the top, now with two Jobs, indicates the
original Harvest Job was the input for this new Transform Job.
Transforms can take a bit longer than harvests, particularly with the
additional Validation Scenarios we are running; but still a small job,
might take anywhere from 15-30 seconds. Refresh the page until it shows
the status as available
.
Also of note, hopefully the Is Valid
column is not red now, and
should read True
. We will look at validations in more detail, but
because we ran the same Validation Scenarios on both Jobs, this suggests
the XSLT transformation fixed whatever validation problems there were
for the Records in the Harvest job.
Looking at Jobs and Records¶
Now is a good time to look at the details of the jobs we have run. Let’s start by looking at the first Harvest Job we ran. Clicking the Job name in the table, or “details” link at the far-right will take you to a Job details page.
Note: Clicking the Job in the graph will gray out any other jobs in the table below that are not a) the job itself, or b) upstream jobs that served as inputs.
Job Details¶
This page provides details about a specific Job.
Major sections can be found behind the various tabs, and include:
- Records
- a table of all records contained in this Job
- Mapped Fields
- statistical breakdown of indexed fields, with ability to view values per field
- Input Jobs
- what Jobs were used as inputs for this Job
- Validation
- shows all Validations run for this Job, with reporting
- Job Type Specific Details
- depending on the Job type, details relevant to that task (e.g. Transform Jobs will show all Records that were modified)
- DPLA Bulk Data Matches
- if run and configured, shows matches with DPLA bulk data sets
Records¶
Sortable, searchable, this shows all the individual, discrete Records for this Job. This is one, but not the only, entry point for viewing the details about a single Record. It is also helpful for determining if the Record is unique with respect to other Records from this Job.
Mapped Fields¶
This table represents all mapped fields from the Record’s original source XML record to ElasticSearch.
To this point, we have been using the default configurations for mapping, but more complex mappings can be provided when running a new Job, or when re-indexing a Job. These configurations are covered in more detail in Field Mapping.
At a glance, field mapping attempts to convert XML into a key/value pairs suitable for a search platform like ElasticSearch. Combine does this via a library xml2kvp
, which stands for “XML to Key/Value Pairs” that accepts a medley of configurations in JSON format. These JSON parameters are referred to as “Field Mapper Configurations” throughout.
For example, it might map the following XML block from a Record’s MODS metadata:
<mods:mods>
<mods:titleInfo>
<mods:title>Edmund Dulac's fairy-book : </mods:title>
<mods:subTitle>fairy tales of the allied nations</mods:subTitle>
</mods:titleInfo>
</mods:mods>
to the following two ElasticSearch key/value pairs:
mods|mods_mods|titleInfo_mods|title : Edmund Dulac's fairy-book :
mods|mods_mods|titleInfo_mods|subTitle : fairy tales of the allied nations
An example of a field mapping configuration that could be applied would be the remove_ns_prefix
which removes XML namespaces prefixes from the resulting fields. This would result in the following fields, removing the mods
prefix and delimiter for each field:
mods_titleInfo_title : Edmund Dulac's fairy-book :
mods_titleInfo_subTitle : fairy tales of the allied nations
It can be dizzying at a glance, but it provides a thorough and comprehensive way to analyze the breakdown of metadata field usage across all Records in a Job. With, of course, the understanding that these “flattened” fields are not shaped like the raw, potentially hierarchical XML from the Record, but nonetheless crosswalk the values in one way or another.
Clicking on the mapped, ElasticSearch field name on the far-left will
reveal all values for that dynamically created field, across all
Records. Clicking on a count from the column Document with Field
will return a table of Records that have a value for that field,
Document without
will show Records that do not have a value for
this field.
An example of how this may be helpful: sorting the column
Documents without
in ascending order with zero at the top, you can
scroll down until you see the count 11
. This represents a subset of
Records – 11 of them – that do not have the field
mods|mods_mods|subject_mods|topic
, which might itself be helpful to know. This is
particularly true with fields that might represent titles, identifiers,
or other required information. The far end of the column, we can see that 95% of Records have this field, and 34% of those have unique values.
Clicking on the button “Show field analysis explanation” will reveal some information about other columns from this table.
Note: Short of an extended discussion about this mapping, and possible value, it is worth noting these indexed fields are used almost exclusively for analysis and creating subsets through queries of Records in Combine, and are not any kind of final mapping or transformation on the Record itself. The Record’s XML is always stored seperately in MySQL (and on disk as Avro files), and is used for any downstream transformations or publishing. The only exception being where Combine attempts to query the DPLA API to match records, which is based on these mapped fields, but more on that later.
Validation¶
This table shows all the Validation Scenarios that were run for this job, including any/all failures for each scenario.
For our example Harvest, under DPLA minimum, we can see that there were 250 Records that failed validation. For the Date checker validation, all records passed. We can click “See Failures” link to get the specific Records that failed, with some information about which tests within that Validation Scenario they failed.
Additionally, we can click “Generate validation results report” to generate an Excel or .csv output of the validation results. From that screen, you are able to select:
- which Validation Scenarios to include in report
- any mapped fields (see below for an explanation of them) that would be helpful to include in the report as columns
More information about Validation Scenarios.
Record Details¶
Next, we can drill down one more level and view the details of an
individual Record. From the Record table tab, click on the
Record ID
of any individual Record. At this point, you are presented
with the details of that particular Record.
Similar to a Job’s details, a Record details page has tabs that house the following sections:
- Record XML
- Indexed Fields
- Record stages
- Validation
- DPLA Link
- Job Type Specific
Record XML¶
The raw XML document for this Record. Note: As mentioned, regardless of how fields are mapped in Combine to ElasticSearch, the Record’s XML or “document” is always left intact, and is used for any downstream Jobs. Combine provides mapping and analysis of Records through mapping to ElasticSearch, but the Record’s XML document is stored as plain, LONGTEXT
in MySQL for each Job.
Mapped Fields¶
This table shows the individual fields in ElasticSearch that were mapped
from the Record’s XML metadata. This can further reveal how this mapping
works, by finding a unique value in this table, noting the
Field Name
, and then searching for that value in the raw XML below.
This table is mostly for informational purposes, but also provides a way
to map generically mapped indexed fields from Combine, to known fields
in the DPLA metadata profile. This can be done with the from the
dropdowns under the DPLA Mapped Field
column.
Why is this helpful? One goal of Combine is to determine how metadata will eventually map to the DPLA profile. Short of doing the mapping that DPLA does when it harvests from a Service Hub, which includes enrichments as well, we can nonetheless try and “tether” this record on a known unique field to the version that might currently exist in DPLA already.
To do this, two things need to happen:
- register for a DPLA API
key, and
provide that key in
/opt/combine/combine/locasettings.py
for the variableDPLA_API_KEY
. - find the URL that points to your actual item (not the thumbnail) in
these mapped fields in Combine, and from the
DPLA Mapped Field
dropdown, selectisShownAt
. TheisShownAt
field in DPLA records contain the URL that DPLA directs users back to, aka the actual item online. This is a particularly unique field to match on. Iftitle
ordescription
are set, Combine will attempt to match on those fields as well, butisShownAt
has proven to be much more accurate and reliable.
If all goes well, when you identify the indexed field in Combine that
contains your item’s actual online URL, and map to isShownAt
from
the dropdown, the page will reload and fire a query to the DPLA API and
attempt to match the record. If it finds a match, a new section will
appear called “DPLA API Item match”, which contains the
metadata from the DPLA API that matches this record.
This is an area still under development. Though the isShownAt
field
is usually very reliable for matching a Combine record to its live DPLA
item counterpart, obviously it will not match if the URL has changed
between harvests. Some kind of unique identifier might be even better,
but there are problems there as well a bit outside the scope of this
QuickStart guide.
Record stages¶
This table represents the various “stages”, aka Jobs, this Record exists
in. This is good insight into how Records move through Combine. We
should see two stages of this Record in this table: one for the original
Harvest Job (bolded, as that is the version of the Record we are
currently looking at), and one as it exists in the “downstream”
Transform Job. We could optionally click the ID
column for a
downstream Record, which would take us to that stage of the Record,
but let’s hold off on that for now.
For any stage in this table, you may view the Record Document (raw Record XML), the associated, mapped ElasticSearch document (JSON), or click into the Job details for that Record stage.
Note: Behind the scenes, a Record’s combine_id
field is used for linking across Jobs. Formerly, the record_id
was used, but it became evident that the ability to transform a Record’s identifier used for publishing would be important. The combine_id
is not shown in this table, but can be viewed at the top of the Record details page. These are UUID4 in format.
Validation¶
This area shows all the Validation scenarios that were run for this Job, and how this specific record fared. In all likelihood, if you’ve been following this guide with the provided demo data, and you are viewing a Record from the original Harvest, you should see that it failed validation for the Validation scenario, DPLA minimum. It will show a row in this table for each rule form the Validation Scenario the Record failed, as a single Validation Scenario – schematron or python – may contain multiples rules / tests. You can click “Run Validation” to re-run and see the results of that Validation Scenario run against this Record’s XML document.
Harvest Details (Job Type Specific Details)¶
As we are looking at Records for a Harvest Job, clicking this tab will not provide much information. However, this is a good opportunity to think about how records are linked: we can look at the Transformation details for the same Record we are currently looking at.
To do this:
- Click the “Record Stages” tab
- Find the second row in the table, which is this same Record but as part of the Transformation Job, and click it
- From that new Record page, click the “Transform Details” tab
- unlike the “Harvest Details”, this provides more information, including a diff of the Record’s original XML if it has changed
Duplicating and Merging Jobs¶
This QuickStart guide won’t focus on Duplicating / Merging Jobs, but it worth knowing this is possible. If you were to click “Duplicate / Merge” link at the bottom of the RecordGroup page, you would be presented with a familiar Job creation screen, with one key difference: when selecting you input jobs, the radio buttons have been replaced by checkboxes, indicating your can select multiple jobs as input. Or, you can select a single Job as well.
The use cases are still emerging when this could be helpful, but here are a couple of examples…
Merging Jobs¶
In addition to “pulling” Jobs from one RecordGroup into another, it might also be beneficial to merge multiple Jobs into one. An example might be:
- Harvest a single group of records via an OAI-PMH set
- Perform a Transformation tailored to that group of records (Job)
- Harvest another group of records via a different OAI-PMH set
- Perform a Transformation tailored to that group of records (Job)
- Finally, Merge these two Transform Jobs into one, suitable for publishing from this RecordGroup.
Here is a visual representation of this scenario, taken directly from the RecordGroup page:
Look for duplicates in Jobs¶
A more specific case might be looking for duplicates between two Jobs. In this scenario, there were two OAI endpoints with nearly the same records, but not identical. Combine allowed
- Harvesting both
- Merging and looking for duplicates in the Record table
Publishing Records¶
If you’ve made it this far, at this point we have:
- Created the Organization, “Amazing University”
- Created the RecordGroup, “Fedora Repository”
- Harvested 250 Records from static XML files
- Transformed those 250 Records to meet our Service Hub profile
- thereby also fixing validation problems revealed in Harvest
- Looked at Job and Record details
Now, we may be ready to “publish” these materials from Combine for harvesting by others (e.g. DPLA).
Overview¶
Publishing is done at the RecordGroup level, giving more weight to the idea of a RecordGroup as a meaningful, intellectual group of records. When a RecordGroup is published, it can be given a “Publish Set ID”, which translates directly to an OAI-PMH set. Note: It is possible to publish multiple, distinct RecordGroups with the same publish ID, which has the effect of allowing multiple RecordGroups to be published under the same OAI-PMH set.
Combine comes with an OAI-PMH server baked in that serves all published RecordGroups via the OAI-PMH HTTP protocol.
Publishing a RecordGroup¶
To run a Publish Job and publish a RecordGroup, navigate to the RecordGroup page, and near the top click the “Publish” button inside the top-most, small table.
You will be presented with a new Job creation screen.
Near the top, there are some fields for entering information about an Publish set identifier. You can either select a previously used Publish set identifier from the dropdown, or create a new one. Remember, this will become the OAI set identifier used in the outgoing Combine OAI-PMH server.
Let’s give it a new, simple identifier: fedora
, representing that this RecordGroup is a workspace for Jobs and Records from our Fedora
repository.
Then, from the table below, select the Job (again, think as a stage of the same records) that will be published for this RecordGroup. Let’s select the Transformation Job that had passed all validations.
Finally, click “Publish” at the bottom.
You will be returned to the RecordGroup, and should see a new Publish
Job with status running
, further extending the Job “lineage” graph
at the top. Publish Jobs are usually fairly quick, as they are copy most
data from the Job that served as input.
In a few seconds you should be able to refresh the page and see this Job
status switch to available
, indicating the publishing is complete.
Near the top, you can now see this Record Group is published:
Let’s confirm and see them as published records…
Viewing published records¶
From any screen, click the “Published” link at the very top in the navigation links. This brings you to a new page with some familiar looking tables.
At the very top is a section “Published Sets”. These show all RecordGroups that have been published, with the corresponding OAI set identifier. This also provides a button to unpublish a RecordGroup (also doable from the RecordGroup page).
To the right is an area that says, “Analysis.” Clicking this button will fire a new Analysis Job – which has not yet been covered, but is essentially an isolated Job that takes 1+ Jobs from any Organization, and RecordGroup, for the purpose of analysis – with all the Published Jobs automatically selected. This provides a single point of analysis for all Records published from Combine.
Below that is a table – similar to the table from a single Job details
– showing all Records that are published, spanning all RecordGroups
and OAI sets. One column of note is Unique in Published?
which
indicates whether or not this Record is unique among all published
Records. Note: This test is determined by checking the record_id
field for published records; if two records are essentially the same,
but have different record_id
s, this will not detect that.
Below that table, is the familiar “Indexed Fields” table. This table shows mapped, indexed fields in ElasticSearch for all Records across all RecordGroups published. Similar to a single Job, this can be useful for determining irregularities among published Records (e.g. small subset of Records that don’t have an important field).
Finally, at the very bottom are some links to the actual OAI-PMH serer coming out of Combine, representing four common OAI-PMH verbs:
- Identify
- basic identification of the Combine OAI-PMH server
- List
Identifiers
- list OAI-PMH identifiers for all published Records
- List Records
- list full records for all published Records (primary mechanism for harvest)
- List Sets
- list all OAI-PMH sets, a direct correlation to OAI sets identifiers for each published RecordGroup
Analysis Jobs¶
From any screen, clicking the “Analysis” link at the top in the navigation links will take you to the Analysis Jobs space. Analysis Jobs are a special kind of Job in Combine, as they are meant to operate outside the workflows of a RecordGroup.
Analysis Jobs look and feel very much like Duplicate / Merge Jobs, and that’s because they share mechanisms on the back-end. When starting a new Analysis Job, by clicking the “Run new analysis job” link at the bottom of the page, you are presented with a familiar screen to run a new Job. However, you’ll notice that you can select Jobs from any RecordGroup, and multiple jobs if so desired, much like Duplicate/Merge Jobs.
An example use case may be running an Analysis Job across a handful of Jobs, in different RecordGroups, to get a sense of how fields are used. Or run a battery or validation tests that may not relate directly to the workflows of a RecordGroup, but are helpful to see all the same.
Analysis Jobs are not shown in RecordGroups, and are not available for selection as input Jobs from any other screens; they are a bit of an island, solely for the purpose of their Analysis namesake.
Troubleshooting¶
Undoubtedly, things might go sideways! As Combine is still quite rough around some edges, here are some common gotchas you may encounter.
Run a job, status immediately flip to available
, and has no records¶
The best way to diagnose why a job may have failed, from the RecordGroup
screen, is to click “Livy Statement” link under the Monitor
column.
This returns the raw output from the Spark job, via Livy which
dispatches jobs to Spark.
A common error is a stale Livy connection, specifically its MySQL connection, which is revealed at the end of the Livy statement output by:
MySQL server has gone away
This can be fixed by restarting the Livy session.
Cannot start a Livy session¶
Information for diagnosing can be found in the Livy logs at /var/log/livy/livy.stderr
.
Data Model¶
Overview¶
Combine’s Data Model can be roughly broken down into the following hierachy:
Organization --> RecordGroup --> Job --> Record
Organization¶
Organizations are the highest level of organization in Combine. It is loosely based on a “Data Provider” in REPOX, also the highest level of hierarchy. Organizations contain Record Groups.
Combine was designed to be flexible as to where it exists in a complicated ecosystem of metadata providers and harvesters. Organizations are meant to be helpful if a single instance of Combine is used to manage metadata from a variety of institutions or organizations.
Other than a level of hierarchy, Organizations have virtually no other affordances.
We might imagine a single instance of Combine, with two Organizations:
- Foo University
- Bar Historical Society
Foo University would contain all Record Groups that pertain to Foo University. One can imagine that Foo University has a Fedora Repository, Omeka, and might even aggregate records for a small historical society or library as well, each of which would fall under the Organization.
Record Group¶
Record Groups fall under Organizations, and are loosely based on a “Data Set” in REPOX. Record Groups contain Jobs.
Record Groups are envisioned as the right level of hierarchy for a group of records that are intellectually grouped, come from the same system, or might be managed with the same transformations and validations.
From our Foo University example above, the Fedora Repository, Omeka installs, and the records from a small historical society – all managed and mediated by Foo University – might make nice, individual, distinct Record Groups.
Job¶
Jobs are contained with a Record Group, and contain Records.
This is where the model forks from REPOX, in that a Record Group can, and likely will, contain multiple Jobs. It is reasonable to also think of a Job as a stage of records.
Jobs represent Records as they move through the various stages of harvesting, sub-dividing, and transforming. In a typical Record Group, you may see Jobs that represent a harvest of records, another for transforming the records, perhaps yet another transformation, and finally a Job that is “published”. In this way, Jobs also provide an approach to versioning Records.
Imagine the record baz
that comes with the harvest from Job1
. Job2
is then a transformation style Job that uses Job1
as input. Job3
might be another transformation, and Job4
a final publishing of the records. In each Job, the record baz
exists, at those various stages of harvesting and transformation. Combine errs on the side of duplicating data in the name of lineage and transparency as to how and why a Record “downstream” looks they way it does.
As may be clear by this point, Jobs are used as input for other Jobs. Job1
serves as the input Records for Job2
, Job2
for Job3
, etc.
There are four primary types of Jobs:
- Harvests
- Transformations
- Merge / Duplicate
- Analysis
It is up to the user how to manage Jobs in Combine, but one strategy might be to leave previous harvests, transforms, and merges of Jobs within a RecordGroup for historical purposes. From an organizational standpoint, this may look like:
Harvest, 1/1/2017 --> Transform to Service Hub Profile
Harvest, 4/1/2017 --> Transform to Service Hub Profile
Harvest, 8/1/2017 --> Transform to Service Hub Profile
Harvest, 1/1/2018 --> Transform to Service Hub Profile (Published)
In this scenario, this Record Group would have 9 total Jobs, but only only the last “set” of Jobs would represent the currently published Records.
Harvest Jobs¶
Harvest Jobs are how Records are initially created in Combine. This might be through OAI-PMH harvesting, or loading from static files.
As the creator of Records, Harvest Jobs do not have input Jobs.
Transformation Jobs¶
Transformation Jobs, unsurprisingly, transform the Records in some way! Currently, XSLT and python code snippets are supported.
Transformation Jobs allow a single input Job, and are limited to Jobs within the same RecordGroup.
Merge / Duplicate Jobs¶
Merge / Duplicate Jobs are true to their namesake: merging Records across multiple Jobs, or duplicating all Records from a single Job, into a new, single Job.
Analysis Jobs¶
Analysis Jobs are Merge / Duplicate Jobs in nature, but exist outside of the normal
Organization --> Record Group
hierarchy. Analysis Jobs are meant as ephemeral, disposable, one-off Jobs for analysis purposes only.
Record¶
The most granular level of hierarchy in Combine is a single Record. Records are part of Jobs.
Record’s actual XML content, and other attributes, are recorded in MongoDB, while their indexed fields are stored in ElasticSearch.
Identifiers¶
Additionally, Record’s have three important identifiers:
- Database ID
id
(integer)- This is the ObjectID in MongoDB, unique for all Records
- Combine ID
combine_id
(string)- this is randomly generated for a Record on creation, and is what allows for linking of Records across Jobs, and is unique for all Records
- Record ID
record_id
(string)- not necessarily unique for all Records, this is identifier is used for publishing
- in the case of OAI-PMH harvesting, this is likely populated from the OAI identifier that the Record came in with
- this can be modified with a Record Identifier Transform when run with a Job
Why the need to transform identifiers?¶
Imagine the following scenario:
Originally, there were multiple REPOX instances in play for a series of harvests and transforms. With each OAI “hop”, the identifier for a Record is prefixed with information about that particular REPOX instance.
Now, with a single instance of Combine replacing multiple REPOX instances and OAI “hops”, records that are harvested are missing pieces of the identifier that were previously created along the way.
Or, insert a myriad of other reasons why an identifier may drift or change.
Combine allows for the creation of Record Identifier Transformation Scenarios that allow for the modification of the record_id
. This allows for the emulation of previous configurations or ecosystems, or optionally creating Record Identifiers – what is used for publishing – based on information from the Record’s XML record with XPath or python code snippets.
Spark and Livy¶
Combine was designed to provide a single point of interaction for metadata harvesting, transformation, analysis, and publishing. Another guiding factor was a desire to utilize DPLA’s Ingestion 3 codebase where possible, which itself, uses Apache Spark for processing large numbers of records. The decision to use Ingestion 3 drove the architecture of Combine to use Apache Spark as the primary, background context and environment for processing Records.
This is well and good from a command line, issuing individual tasks to be performed, but how would this translate to a GUI that could be used to initiate tasks, queue them, and view the results? It became evident that an intermediary piece was needed to facilitate running Spark “jobs” from a request/response oriented front-end GUI. Apache Livy was suggested as just such a piece, and fit the bill perfectly. Livy allows for the submission of jobs to a running Spark context via JSON, and the subsequent ability to “check” on the status of those jobs.
As Spark natively allows python code as a language for submitting jobs, Django was chosen as the front-end framework for Combine, to have some parity between the language of the GUI front-end and the language of the actual code submitted to Spark for batch processing records.
This all conspires to make Combine relatively fast and efficient, but adds a level of complexity. When Jobs are run in Combine, they are submitted to this running, background Spark context via Livy. While Livy is utilized in a similar fashion at scale for large enterprise systems, it is often obfuscated from users and the front-end GUI. This is partially the case for Combine.
Livy Sessions¶
Livy creates Spark contexts that can receive jobs via what it calls “sessions”. In Combine, only one active Livy session is allowed at a time. This is partially for performance reasons, to avoid gobbling up all of the server’s resources, and partially to enforce a sequential running of Spark Jobs that avoids many of the complexities that would be introduced if Jobs – that require input from the output of one another – were finishing at different times.
Manage Livy Sessions¶
Navigate to the “System” link at the top-most navigation of Combine. If no Livy sessions are found or active, you will be presented with a screen that looks like this:
To begin a Livy session, click “Start New Livy Session”. The page will refresh and you should see a screen that shows the Livy session is starting
:
After 10-20 seconds, the page can be refreshed and it should show the Livy session as idle
, meaning it is ready to receive jobs:
Barring any errors with Livy, this is the only interaction with Livy that a Combine user needs to concern themselves with. If this Livy Session grows stale, or is lost, Combine will attempt to automatically restart when it’s needed. This will actually remove and begin a new session, but this should remain invisible to the casual user.
However, a more advanced user may choose to remove an active Livy session from Combine from this screen. When this happens, Combine cannot automatically refresh the Livy connection when needed, and all work requiring Spark will fail. To begin using Livy/Spark again, a new Livy session will need to be manually started per the instructions above.
Configuration¶
Combine relies heavily on front-loading configuration, so that the process of running Jobs is largely selecting pre-existing “scenarios” that have already been tested and configured.
This section will outline configuration options and associated configuration pages.
Note: Currently, Combine leverages Django’s built-in admin interface for editing and creating model instances – transformations, validations, and other scenarios – below. This will likely evolve into more tailored CRUDs for each, but for the time being, there is a link to the Django admin panel on the Configuration screen.
Note: What settings are not configurable via the GUI in Combine, are configurable in the file combine/localsettings.py
.
Field Mapper Configurations¶
Field Mapping is the process of mapping values from a Record’s sourece document (likely XML) and to meaningful and analyzable key/value pairs that can be stored in ElasticSearch. These mapped values from a Record’s document are used in Combine for:
- analyzing distribution of XML elements and values across Records
- exporting to mapped field reports
- for single Records, querying the DPLA API to check existence
- comparing Records against DPLA bulk data downloads
- and much more!
To perform this mapping, Combine uses an internal library called XML2kvp
, which stands for “XML to Key/Value Pairs”, to map XML to key/value JSON documents. Under the hood, XML2kvp
uses xmltodict to parse the Record XML into a hierarchical dictionary, and then loops through that, creating fields based on the configurations below.
I’ve mapped DC or MODS to Solr or ElasticSearch, why not do something similar?¶
Each mapping is unique: to support different access, preservation, or analysis purposes. A finely tuned mapping for one metadata format or institution, might be unusable for another, even for the same metadata format. Combine strives to be metadata format agnostic for harvesting, transformation, and analysis, and furthermore, performing these actions before a mapping has even been created or considered. To this end, a “generic” but customizable mapper was needed to take XML records and convert them into fields that can be used for developing an understanding about a group of Records.
While applications like Solr and ElasticSearch more recently support hierarchical documents, and would likely support a straight XML to JSON converted document (with xmltodict, or Object Management Group (OMG)’s XML to JSON conversion standard), the attributes in XML give it a dimensionality beyond simple hierarchy, and can be critical to understanding the nature and values of a particular XML element. These direct mappings would function, but would not provide the same scannable, analysis of a group of XML records.
XML2kvp provides a way to blindly map most any XML document, providing a broad overview of fields and structures, with the ability to further narrow and configure. A possible update/improvement would be the ability for users to upload mappers of their making (e.g. XSLT) that would result in a flat mapping, but that is currently not implemented.
How does it work¶
XML2kvp converts elements from XML to key/value pairs by converting hierarchy in the XML document to character delimiters.
Take for example the following, “unique” XML:
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns:internet="http://internet.com">
<foo>
<bar>42</bar>
<baz>109</baz>
</foo>
<foo>
<bar>42</bar>
<baz>109</baz>
</foo>
<foo>
<bar>9393943</bar>
<baz>3489234893</baz>
</foo>
<tronic type='tonguetwister'>Sally sells seashells by the seashore.</tronic>
<tronic type='tonguetwister'>Red leather, yellow leather.</tronic>
<tronic>You may disregard</tronic>
<goober scrog='true' tonk='false'>
<depths>
<plunder>Willy Wonka</plunder>
</depths>
</goober>
<nested_attribs type='first'>
<another type='second'>paydirt</another>
</nested_attribs>
<nested>
<empty></empty>
</nested>
<internet:url url='http://example.com'>see my url</internet:url>
<beat type="4/4">four on the floor</beat>
<beat type="3/4">waltz</beat>
<ordering>
<duck>100</duck>
<duck>101</duck>
<goose>102</goose>
<it>run!</it>
</ordering>
<ordering>
<duck>200</duck>
<duck>201</duck>
<goose>202</goose>
<it>run!</it>
</ordering>
</root>
Converted with default options from XML2kvp, you would get the following key/value pairs in JSON form:
{'root_beat': ('four on the floor', 'waltz'),
'root_foo_bar': ('42', '9393943'),
'root_foo_baz': ('109', '3489234893'),
'root_goober_depths_plunder': 'Willy Wonka',
'root_nested_attribs_another': 'paydirt',
'root_ordering_duck': ('100', '101', '200', '201'),
'root_ordering_goose': ('102', '202'),
'root_ordering_it': 'run!',
'root_tronic': ('Sally sells seashells by the seashore.',
'Red leather, yellow leather.',
'You may disregard'),
'root_url': 'see my url'}
Some things to notice…
- the XML root element
<root>
is present for all fields asroot
- the XML hierarchy
<root><foo><bar>
repeats twice in the XML, but is collapsed into a single fieldroot_foo_bar
- moreover, because
skip_repeating_values
is set totrue
, the value42
shows up only once, if set tofalse
we would see the value('42', '42', '9393943')
- a distinct absence of all attributes from the original XML, this is because
include_all_attributes
is set tofalse
by default.
Running with include_all_attributes
set to true
, we see a more complex and verbose output, with @
in various field names, indicating attributes:
{'root_beat_@type=3/4': 'waltz',
'root_beat_@type=4/4': 'four on the floor',
'root_foo_bar': ('42', '9393943'),
'root_foo_baz': ('109', '3489234893'),
'root_goober_@scrog=true_@tonk=false_depths_plunder': 'Willy Wonka',
'root_nested_attribs_@type=first_another_@type=second': 'paydirt',
'root_ordering_duck': ('100', '101', '200', '201'),
'root_ordering_goose': ('102', '202'),
'root_ordering_it': 'run!',
'root_tronic': 'You may disregard',
'root_tronic_@type=tonguetwister': ('Sally sells seashells by the seashore.',
'Red leather, yellow leather.'),
'root_url_@url=http://example.com': 'see my url'}
A more familiar example may be Dublin Core XML:
<oai_dc:dc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns="http://www.openarchives.org/OAI/2.0/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Fragments of old book</dc:title>
<dc:creator>Unknown</dc:creator>
<dc:date>1601</dc:date>
<dc:description>An object of immense cultural and historical worth</dc:description>
<dc:subject>Writing--Materials and instruments</dc:subject>
<dc:subject>Archaeology</dc:subject>
<dc:coverage>1600-1610</dc:coverage>
<dc:identifier>book_1234</dc:identifier>
</oai_dc:dc>
And with default configurations, would map to:
{'dc_coverage': '1600-1610',
'dc_creator': 'Unknown',
'dc_date': '1601',
'dc_description': 'An object of immense cultural and historical worth',
'dc_identifier': 'book_1234',
'dc_subject': ('Writing--Materials and instruments', 'Archaeology'),
'dc_title': 'Fragments of old book'}
Configurations¶
Within Combine, the configurations passed to XML2kvp are referred to as “Field Mapper Configurations”, and like many other parts of Combine, can be named, saved, and updated in the database for later, repeated use. This following table describes the configurations that can be used for field mapping.
Parameter | Type | Description |
---|---|---|
add_literals |
object |
Key/value pairs for literals to mixin, e.g. foo :bar would create field foo with value bar [Default: {} ] |
capture_attribute_values |
array |
Array of attributes to capture values from and set as standalone field, e.g. if [age ] is provided and encounters <foo age='42'/> , a field foo_@age@ would be created (note the additional trailing @ to indicate an attribute value) with the value 42 . [Default: [] , Before: copy_to , copy_to_regex ] |
concat_values_on_all_fields |
[boolean ,``string``] |
Boolean or String to join all values from multivalued field on [Default: false ] |
concat_values_on_fields |
object |
Key/value pairs for fields to concat on provided value, e.g. foo_bar :- if encountering foo_bar :[goober ,``tronic``] would concatenate to foo_bar :goober-tronic [Default: {} ] |
copy_to_regex |
object |
Key/value pairs to copy one field to another, optionally removing original field, based on regex match of field, e.g. .*foo :bar would copy create field bar and copy all values fields goober_foo and tronic_foo to bar . Note: Can also be used to remove fields by setting the target field as false, e.g. .*bar :false , would remove fields matching regex .*bar [Default: {} ] |
copy_to |
object |
Key/value pairs to copy one field to another, optionally removing original field, e.g. foo :bar would create field bar and copy all values when encountered for foo to bar , removing foo . However, the original field can be retained by setting remove_copied_key to true . Note: Can also be used to remove fields by setting the target field as false, e.g. ‘foo’:false , would remove field foo . [Default: {} ] |
copy_value_to_regex |
object |
Key/value pairs that match values based on regex and copy to new field if matching, e.g. http.* :websites would create new field websites and copy http://exampl.com and https://example.org to new field websites [Default: {} ] |
error_on_delims_collision |
boolean |
Boolean to raise DelimiterCollision exception if delimiter strings from either node_delim or ns_prefix_delim collide with field name or field value (false by default for permissive mapping, but can be helpful if collisions are essential to detect) [Default: false ] |
exclude_attributes |
array |
Array of attributes to skip when creating field names, e.g. [baz ] when encountering XML <foo><bar baz='42' goober='1000'>tronic</baz></foo> would create field foo_bar_@goober=1000 , skipping attribute baz [Default: [] ] |
exclude_elements |
array |
Array of elements to skip when creating field names, e.g. [baz ] when encountering field <foo><baz><bar>tronic</bar></baz></foo> would create field foo_bar , skipping element baz [Default: [] , After: include_all_attributes , include_attributes ] |
include_all_attributes |
boolean |
Boolean to consider and include all attributes when creating field names, e.g. if false , XML elements <foo><bar baz='42' goober='1000'>tronic</baz></foo> would result in field name foo_bar without attributes included. Note: the use of all attributes for creating field names has the the potential to balloon rapidly, potentially encountering ElasticSearch field limit for an index, therefore false by default. [Default: false , Before: include_attributes , exclude_attributes ] |
include_attributes |
array |
Array of attributes to include when creating field names, despite setting of include_all_attributes , e.g. [baz ] when encountering XML <foo><bar baz='42' goober='1000'>tronic</baz></foo> would create field foo_bar_@baz=42 [Default: [] , Before: exclude_attributes , After: include_all_attributes ] |
include_meta |
boolean |
Boolean to include xml2kvp_meta field with output that contains all these configurations [Default: false ] |
node_delim |
string |
String to use as delimiter between XML elements and attributes when creating field name, e.g. ___ will convert XML <foo><bar>tronic</bar></foo> to field name foo___bar [Default: _ ] |
ns_prefix_delim |
string |
String to use as delimiter between XML namespace prefixes and elements, e.g. | for the XML <ns:foo><ns:bar>tronic</ns:bar></ns:foo> will create field name ns|foo_ns:bar . Note: a | pipe character is used to avoid using a colon in ElasticSearch fields, which can be problematic. [Default: | ] |
remove_copied_key |
boolean |
Boolean to determine if originating field will be removed from output if that field is copied to another field [Default: true ] |
remove_copied_value |
boolean |
Boolean to determine if value will be removed from originating field if that value is copied to another field [Default: false ] |
remove_ns_prefix |
boolean |
Boolean to determine if XML namespace prefixes are removed from field names, e.g. if false , the XML <ns:foo><ns:bar>tronic</ns:bar></ns:foo> will result in field name foo_bar without ns prefix [Default: true ] |
self_describing |
boolean |
Boolean to include machine parsable information about delimeters used (reading right-to-left, delimeter and its length in characters) as suffix to field name, e.g. if true , and node_delim is ___ and ns_prefix_delim is | , suffix will be ___3|1 . Can be useful to reverse engineer field name when not re-parsed by XML2kvp. [Default: false ] |
skip_attribute_ns_declarations |
boolean |
Boolean to remove namespace declarations as considered attributes when creating field names [Default: true ] |
skip_repeating_values |
boolean |
Boolean to determine if a field is multivalued, if those values are allowed to repeat, e.g. if set to false , XML <foo><bar>42</bar><bar>42</bar></foo> would map to foo_bar :42 , removing the repeating instance of that value. [Default: true ] |
skip_root |
boolean |
Boolean to determine if the XML root element will be included in output field names [Default: false ] |
split_values_on_all_fields |
[boolean ,``string``] |
If present, string to use for splitting values from all fields, e.g. `` `` will convert single value a foo bar please into the array of values [a ,``foo``,``bar``,``please``] for that field [Default: false ] |
split_values_on_fields |
object |
Key/value pairs of field names to split, and the string to split on, e.g. foo_bar :, will split all values on field foo_bar on comma , [Default: {} ] |
Saving and Reusing¶
Field Mapper sonfigurations may be saved, named, and re-used. This can be done anytime field mapper configurations are being set, e.g. when running a new Job, or re-indexing a previously run Job.
Testing¶
Field Mapping can also be tested against a single record, accessible from a Record’s page under the “Run/Test Scenarios for this Record” tab. The following is a screenshot of this testing page:
In this screenshot, you can see a single Record is used as input, a Field Mapper Configurations applied, and the resulting mapped fields at the bottom.
OAI Server Endpoints¶
Configuring OAI endpoints is the first step for harvesting from OAI endpoints.
To configure a new OAI endpoint, navigate to the Django admin screen, under the section “Core” select Oai endpoints
.
This model is unique among other Combine models in that these values are sent almost untouched to the DPLA Ingestion 3 OAI harvesting codebase. More information on these fields can be found here.
The following fields are all required:
Name
- Human readable name for OAI endpoint, used in dropdown menu when running harvestEndpoint
- URL for OAI server endpoint. This should include the full URL up until, but not including, GET parameters that begin with a question mark?
.Verb
- This pertains to the OAI-PMH verb that will be used for harvesting. Almost always,ListRecords
is the required verb here. So much, this will default toListRecords
if left blank.MetadataPrefix
- Another OAI-PMH term, the metadata prefix that will be used during harvesting.Scope type
- Not an OAI term, this refers to what kind of harvesting should be performed. Possible values include:
setList
- This will harvest the comma separated sets provided forScope value
.harvestAllSets
- The most performant option, this will harvest all sets from the OAI endpoint. If this is set, theScope value
field must be set totrue
.blacklist
- Comma separated list of OAI sets to exclude from harvesting.Scope value
- String to be used in conjunction withScope type
outline above.
- If
setList
is used, provide a comma separated string of OAI sets to harvest- If
harvestAllSets
, provide just the single stringtrue
.Once the OAI endpoint has been added in the Django admin, from the configurations page you are presented with a table showing all configured OAI endpoints. The last column includes a link to issue a command to view all OAI sets from that endpoint.
Transformation Scenario¶
Transformation Scenarios are used for transforming the XML of Records during Transformation Jobs. Currently, there are two types of well-supported transformation supported: XSLT and Python code snippets. A third type, transforming Records based on actions performed in Open Refine exists, but is not well tested or documented at this time. These are described in more detail below.
It is worth considering, when thinking about transforming Records in Combine, that multiple transformations can be applied to same Record; “chained” together as separate Jobs. Imagine a scenario where Transformation A
crosswalks metadata from a repository to something more aligned with a state service hub, Transformation B
fixes some particular date formats, and Transformation C
– a python transformation – looks for a particular identifier field and creates a new field based on that. Each of the transformations would be a separate Transformation Scenario, and would be run as separate Jobs in Combine, but in effect would be “chained” together by the user for a group of Records.
All Transformations require the following information:
Name
- Human readable name for Transformation ScenarioPayload
- This is where the actual transformation code is added (more on the different types below)Transformation Type
-xslt
for XSLT transformations, orpython
for python code snippetsFilepath
- This may be ignored (in some cases, transformation payloads were written to disk to be used, but likely deprecated moving forward)
Finally, Transformation Scenarios may be tested within Combine over a pre-existing Record. This is done by clicking the “Test Transformation Scenario” button from Configuration page. This will take you to a screen that is similarly used for testing Transformations, Validations, and Record Identifier Transformations. For Transformations, it looks like the following:
In this screenshot, a few things are happening:
- a single Record has been clicked from the sortable, searchable table, indicating it will be used for the Transformation testing
- a pre-existing Transformation Scenario has been selected from the dropdown menu, automatically populating the payload and transformation type inputs
- however, a user may also add or edit the payload and transformation types live here, for testing purposes
- at the very bottom, you can see the immediate results of the Transformation as applied to the selected Record
Currently, there is no way to save changes to a Transformation Scenario, or add a new one, from this screen, but it allows for real-time testing of Transformation Scenarios.
XSLT¶
XSLT transformations are performed by a small XSLT processor servlet called via pyjxslt. Pyjxslt uses a built-in Saxon HE XSLT processor that supports XSLT 2.0.
When creating an XSLT Transformation Scenario, one important thing to consider are XSLT includes and imports. XSL stylesheets allow the inclusion of other, external stylesheets. Usually, these includes come in two flavors:
- locally on the same filesystem, e.g.
<xsl:include href="mimeType.xsl"/>
- remote, retrieved via HTTP request, e.g.
<xsl:include href="http://www.loc.gov/standards/mods/inc/mimeType.xsl"/>
In Combine, the primary XSL stylesheet provided for a Transformation Scenario is uploaded to the pyjxslt servlet to be run by Spark. This has the effect of breaking XSL include
s that use a local, filesystem href
s. Additionally, depending on server configurations, pyjxslt sometimes has trouble accessing remote XSL include
s. But Combine provides workarounds for both scenarios.
Local Includes¶
For XSL stylesheets that require local, filesystem include
s, a workaround in Combine is to create Transformation Scenarios for each XSL stylesheet that is imported by the primary stylesheet. Then, use the local filesystem path that Combine creates for that Transformation Scenario, and update the <xsl:include>
in the original stylesheet with this new location on disk.
For example, let’s imagine a stylesheet called DC2MODS.xsl
that has the following <xsl:include>
s:
<xsl:include href="dcmiType.xsl"/>
<xsl:include href="mimeType.xsl"/>
Originally, DC2MODS.xsl
was designed to be used in the same directory as two files: dcmiType.xsl
and mimeType.xsl
. This is not possible in Combine, as XSL stylesheets for Transformation Scenarios are uploaded to another location to be used.
The workaround, would be to create two new special kinds of Transformation Scenarios by checking the box use_as_include
, perhaps with fitting names like “dcmiType” and “mimeType”, that have payloads for those two stylesheets. When creating those Transformation Scenarios, saving, and then re-opening the Transformation Scenario in Django admin, you can see a Filepath
attribute has been made which is a copy written to disk.
This Filepath
value can then be used to replace the original <xsl:include>
s in the primary stylesheet, in our example, DC2MODS.xsl
:
<xsl:include href="/home/combine/data/combine/transformations/a436a2d4997d449a96e008580f6dc699.xsl"/> <!-- formerly dcmiType.xsl -->
<xsl:include href="/home/combine/data/combine/transformations/00eada103f6a422db564a346ed74c0d7.xsl"/> <!-- formerly mimeType.xsl -->
Remote Includes¶
When the href
s for XSL includes
s are remote HTTP URLs, Combine attempts to rewrite the primary XSL stylesheet automatically by:
- downloading the external, remote
include
s from the primary stylesheet- saving them locally
- rewriting the
<xsl:include>
element with this local filesystem location
This has the added advantage of effectively caching the remote include, such that it is not retrieved each transformation.
For example, let’s imagine our trusty stylesheet called DC2MODS.xsl
, but with this time external, remote URLs for href
s:
<xsl:include href="http://www.loc.gov/standards/mods/inc/dcmiType.xsl"/>
<xsl:include href="http://www.loc.gov/standards/mods/inc/mimeType.xsl"/>
With no action by the user, when this Transformation Scenario is saved, Combine will attempt to download these dependencies and rewrite, resulting in include
s that look like the following:
<xsl:include href="/home/combine/data/combine/transformations/dcmiType.xsl"/>
<xsl:include href="/home/combine/data/combine/transformations/mimeType.xsl"/>
Note: If sytlesheets that remote include
s rely on external stylesheets that may change or update, the primary Transformation stylesheet – e.g. DC2MODS.xsl
– will have to be re-entered, with the original URLs, and re-saved in Combine to update the local dependencies.
Python Code Snippet¶
An alternative to XSLT transformations are created Transformation Scenarios that use python code snippets to transform the Record. The key to making a successful python Transformation Scenario is code that adheres to the pattern Combine is looking for from a python Transformation. This requires a bit of explanation about how Records are transformed in Spark.
For Transformation Jobs in Combine, each Record in the input Job is fed to the Transformation Scenario. If the transformation type
is xslt
, the XSLT stylesheet for that Transformation Scenario is used as-is on the Record’s raw XML. However, if the transformation type
is python
, the python code provided for the Transformation Scenario will be used.
The python code snippet may include as many imports or function definitions as needed, but will require one function that each Record will be passed to, and this function must be named python_record_transformation
. Additionally, this function must expect one function argument, a passed instance of what is called a PythonUDFRecord. In Spark, “UDF” often refers to a “User Defined Function”; which is precisely what this parsed Record instance is passed to in the case of a Transformation. This is a convenience class that parses a Record in Combine for easy interaction within Transformation, Validation, and Record Identifier Transformation Scenarios. A PythonUDFRecord
instance has the following representations of the Record:
record_id
- The Record Identifier of the Recorddocument
- raw, XML for the Record (what is passed to XSLT records)xml
- raw XML parsed with lxml’s etree, anElementTree
instancensmap
- dictionary of namespaces, useful for working withself.xml
instance
Finally, the function python_record_transformation
must return a python list with the following, ordered elements: [ transformed XML as a string, any errors if they occurred as a string, True/False for successful transformation ]. For example, a valid return might be, with the middle value a blank string indicating no error:
[ "<xml>....</xml>", "", True ]
A full example of a python code snippet transformation might look like the following. In this example, a <mods:accessCondition>
element is added or updated. Note the imports, the comments, the use of the PythonUDFRecord
as the single argument for the function python_record_transformation
, all fairly commonplace python code:
# NOTE: ability to import libraries as needed
from lxml import etree
def python_record_transformation(record):
'''
Python transformation to add / update <mods:accessCondition> element
'''
# check for <mods:accessCondition type="use and reproduction">
# NOTE: not built-in record.xml, parsed Record document as etree instance
# NOTE: not built-in record.nsmap that comes with record instance
ac_ele_query = record.xml.xpath('mods:accessCondition', namespaces=record.nsmap)
# if single <mods:accessCondition> present
if len(ac_ele_query) == 1:
# get single instance
ac_ele = ac_ele_query[0]
# confirm type attribute
if 'type' in ac_ele.attrib.keys():
# if present, but not 'use and reproduction', update
if ac_ele.attrib['type'] != 'use and reproduction':
ac_ele.attrib['type'] = 'use and reproduction'
# if <mods:accessCondition> not present at all, create
elif len(ac_ele_query) == 0:
# build element
rights = etree.Element('{http://www.loc.gov/mods/v3}accessCondition')
rights.attrib['type'] = 'use and reproduction'
rights.text = 'Here is a blanket rights statement for our institution in the absence of a record specific one.'
# append
record.xml.append(rights)
# finally, serialize and return as required list [document, error, success (bool)]
return [etree.tostring(record.xml), '', True]
In many if not most cases, XSLT will fit the bill and provide the needed transformation in Combine. But the ability to write python code for transformation opens up the door to complex and/or precise transformations if needed.
Validation Scenario¶
Validation Scenarios are by which Records in Combine are validated against. Validation Scenarios may be written in the following formats: XML Schema (XSD), Schematron, Python code snippets, and ElasticSearch DSL queries. Each Validation Scenario requires the following fields:
Name
- human readable name for Validation ScenarioPayload
- pasted Schematron or python codeValidation type
-sch
for Schematron,python
for python code snippets, ores_query
for ElasticSearch DSL query type validationsFilepath
- This may be ignored (in some cases, validation payloads were written to disk to be used, but likely deprecated moving forward)Default run
- if checked, this Validation Scenario will be automatically checked when running a new Job
When running a Job, multiple Validation Scenarios may be applied to the Job, each of which will run for every Record. Validation Scenarios may include multiple tests or “rules” with a single scenario. So, for example, Validation A
may contain Test 1
and Test 2
. If run for a Job, and Record Foo
fails Test 2
for the Validation A
, the results will show the failure for that Validation Scenario as a whole.
When thinking about creating Validation Scenarios, there is flexibility in how many tests to put in a single Validation Scenario, versus splitting up those tests between distinct Validation Scenarios, recalling that multiple Validation Scenarios may be run for a single Job. It is worth pointing out, multiple Validation Scenarios for a Job will likely degrade performance more than a multiple tests within a single Scenario, though this has not been testing thoroughly, just speculation based on how Records are passed to Validation Scenarios in Spark in Combine.
Like Transformation Scenarios, Validation Scenarios may also be tested in Combine. This is done by clicking the button, “Test Validation Scenario”, resulting in the following screen:
In this screenshot, we an see the following happening:
- a single Record has been clicked from the sortable, searchable table, indicating it will be used for the Validation testing
- a pre-existing Validation Scenario –
DPLA minimum
, a Schematron validation – has been selected, automatically populating the payload and validation type inputs
- However, a user may choose to edit or input their own validation payload here, understanding that editing and saving cannot currently be done from this screen, only testing
- Results are shown at the bottom in two areas:
Parsed Validation Results
- parsed results of the Validation, showing tests that have passed, failed, and a total count of failuresRaw Validation Results
- raw results of Validation Scenario, in this case XML from the Schematron response, but would be a JSON string for a python code snippet Validation Scenario
As mentioned, two types of Validation Scenarios are currently supported, Schematron and python code snippets, and are detailed below.
XML Schema (XSD)¶
XML Schemas (XSD) may be used to validate a Record’s document. One limitation of XML Schema is that many python based validators will bail on the first error encountered in a document, meaning the resulting Validation failure will only show the first invalid XML segment encountered, though there may be many. However, knowing that a Record has failed even one part of an XML Schema, might be sufficient to look in more detail with an external validator and determine where else it is invalid, or, fix that problem through a transform or re-harvest, and continue to run the XML Schema validations.
Schematron¶
A valid Schematron XML document may be used as the Validation Scenario payload, and will validate the Record’s raw XML. Schematron validations are rule-based, and can be configured to return the validation results as XML, which is the case in Combine. This XML is parsed, and each distinct, defined test is noted and parsed by Combine.
Below is an example of a small Schematron validation that looks for some required fields in an XML document that would help make it DPLA compliant:
<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron" xmlns:mods="http://www.loc.gov/mods/v3">
<ns prefix="mods" uri="http://www.loc.gov/mods/v3"/>
<!-- Required top level Elements for all records record -->
<pattern>
<title>Required Elements for Each MODS record</title>
<rule context="mods:mods">
<assert test="mods:titleInfo">There must be a title element</assert>
<assert test="count(mods:location/mods:url[@usage='primary'])=1">There must be a url pointing to the item</assert>
<assert test="count(mods:location/mods:url[@access='preview'])=1">There must be a url pointing to a thumnail version of the item</assert>
<assert test="count(mods:accessCondition[@type='use and reproduction'])=1">There must be a rights statement</assert>
</rule>
</pattern>
<!-- Additional Requirements within Required Elements -->
<pattern>
<title>Subelements and Attributes used in TitleInfo</title>
<rule context="mods:mods/mods:titleInfo">
<assert test="*">TitleInfo must contain child title elements</assert>
</rule>
<rule context="mods:mods/mods:titleInfo/*">
<assert test="normalize-space(.)">The title elements must contain text</assert>
</rule>
</pattern>
<pattern>
<title>Additional URL requirements</title>
<rule context="mods:mods/mods:location/mods:url">
<assert test="normalize-space(.)">The URL field must contain text</assert>
</rule>
</pattern>
</schema>
Python Code Snippet¶
Similar to Transformation Scenarios, python code may also be used for the Validation Scenarios payload. When a Validation is run for a Record, and a python code snippet type is detected, all defined function names that begin with test_
will be used as separate, distinct Validation tests. This very similar to how pytest looks for function names prefixed with test_
. It is not perfect, but relatively simple and effective.
These functions must expect two arguments. The first is an instance of a PythonUDFRecord. As detailed above, PythonUDFRecord
instances are a parsed, convenient way to interact with Combine Records. A PythonUDFRecord
instance has the following representations of the Record:
record_id
- The Record Identifier of the Recorddocument
- raw, XML for the Record (what is passed to XSLT records)xml
- raw XML parsed with lxml’s etree, anElementTree
instancensmap
- dictionary of namespaces, useful for working withself.xml
instance
The second argument is named and must be called test_message
. The string value for the test_message
argument will be used for reporting if that particular test if failed; this is the human readable name of the validation test.
All validation tests, recalling the name of the function must be prefixed with test_
, must return True
or False
to indicate if the Record passed the validation test.
An example of an arbitrary Validation Scenario that looks for MODS titles longer than 30 characters might look like the following:
# note the ability to import (just for demonstration, not actually used below)
import re
def test_title_length_30(record, test_message="check for title length > 30"):
# using PythonUDFRecord's parsed instance of Record with .xml attribute, and namespaces from .nsmap
titleInfo_elements = record.xml.xpath('//mods:titleInfo', namespaces=record.nsmap)
if len(titleInfo_elements) > 0:
title = titleInfo_elements[0].text
if len(title) > 30:
# returning False fails the validation test
return False
else:
# returning True, passes
return True
# note ability to define other functions
def other_function():
pass
def another_function();
pass
ElasticSearch DSL query¶
ElasticSearch DSL query type Validations Scenarios are a bit different. Instead of validating the document for a Record, ElasticSearch DSL validations validate by performing ElasticSearch queries against mapped fields for a Job, and marking Records as valid or invalid based on whether they are matches for those queries.
These queries may be written such that Records matches are valid, or they may be written where matches are invalid.
An example structure of an ElasticSearch DSL query might look like the following:
[
{
"test_name": "field foo exists",
"matches": "valid",
"es_query": {
"query": {
"exists": {
"field": "foo"
}
}
}
},
{
"test_name": "field bar does NOT have value 'baz'",
"matches": "invalid",
"es_query": {
"query": {
"match": {
"bar.keyword": "baz"
}
}
}
}
]
This example contains two tests in a single Validation Scenario: checking for field foo
, and checking that field bar
does not have value baz
. Each test must contain the following properties:
test_name
: name that will be returned in the validation reporting for failuresmatches
: the stringvalid
if matches to the query can be consider valid, orinvalid
if query matches should be considered invalides_query
: the raw, ElasticSearch DSL query
ElasticSearch DSL queries can support complex querying (boolean, and/or, fuzzy, regex, etc.), resulting in an additional, rich and powerful way to validate Records.
Record Identifier Transformation Scenario¶
Another configurable “Scenario” in Combine is a Record Identifier Transformation Scenario or “RITS” for short. A RITS allows the transformation of a Record’s “Record Identifier”. A Record has three identifiers in Combine, with the Record Identifier (record_id
) as the only changeable, mutable of the three. The Record ID is what is used for publishing, and for all intents and purposes, the unique identifier for the Record outside of Combine.
Record Identifiers are created during Harvest Jobs, when a Record is first created. This Record Identifier may come from the OAI server in which the Record was harvested from, it might be derived from an identifier in the Record’s XML in the case of a static harvest, or it may be minted as a UUID4 on creation. Where the Record ID is picked up from OAI or the Record’s XML itself, it might not need transformation before publishing, and can “go out” just as it “came in.” However, there are instances where transforming the Record’s ID can be quite helpful.
Take the following scenario. A digital object’s metadata is harvested from Repository A
with the ID foo
, as part of OAI set bar
, by Metadata Aggregator A
. Inside Metadata Aggregator A
, which has its own OAI server prefix of baz
considers the full identifier of this record: baz:bar:foo
. Next, Metadata Aggregator B
harvests this record from Metadata Aggregator A
, under the OAI set scrog
. Metadata Aggregator B
has its own OAI server prefix of tronic
. Finally, when a terminal harvester like DPLA harvests this record from Metadata Aggregator B
under the set goober
, it might have a motley identifier, constructed through all these OAI “hops” of something like: tronic:scrog:goober:baz:bar:foo
.
If one of these hops were replaced by an instance of Combine, one of the OAI “hops” would be removed, and the dynamically crafted identifier for that same record would change. Combine allows the ability to transform the identifier – emulating previous OAI “hops”, completely re-writing, or any other transformation – through a Record Identifier Transformation Scenario (RITS).
RITS are performed, just like Transformation Scenarios or Validation Scenarios, for every Record in the Job. RITS may be in the form of:
- Regular Expressions - specifically, python flavored regex
- Python code snippet - a snippet of code that will transform the identifier
- XPATH expression - given the Record’s raw XML, an XPath expression may be given to extract a value to be used as the Record Identifier
All RITS have the following values:
Name
- Human readable name for RITSTransformation type
-regex
for Regular Expression,python
for Python code snippet, orxpath
for XPath expressionTransformation target
- the RITS payload and type may use the pre-existing Record Identifier as input, or the Record’s raw, XML recordRegex match payload
- If using regex, the regular expression to matchRegex replace payload
- If using regex, the regular expression to replace that match with (allows values from groups)Python payload
- python code snippet, that will be passed an instance of a PythonUDFRecordXpath payload
- single XPath expression as a string
Payloads that do not pertain to the Transformation type
may be left blank (e.g. if using python code snippet, regex match and replace payloads, and xpath payloads, may be left blank).
Similar to Transformation and Validation scenarios, RITS can be tested by clicking the “Test Record Identifier Transformation Scenario” button at the bottom. You will be presented with a familiar screen of a table of Records, and the ability to select a pre-existing RITS, edit that one, and/or create a new one. Similarly, without the ability to update or save a new one, merely to test the results of one.
These different types will be outline in a bit more detail below.
Regular Expression¶
If transforming the Record ID with regex, two “payloads” are required for the RITS scenario: a match expression, and a replace expression. Also of note, these regex match and replace expressions are the python flavor of regex matching, performed with python’s re.sub()
.
The screenshot belows shows an example of a regex match / replace used to replace digital.library.wayne.edu
with goober.tronic.org
, also highlighting the ability to use groups:
A contrived example, this shows a regex expression applied to the input Record identifier of oai:digital.library.wayne.edu:wayne:Livingto1876b22354748`
.
Python Code Snippet¶
Python code snippets for RITS operate similarly to Transformation and Validation scenarios in that the python code snippet is given an instance of a PythonUDFRecord for each Record. However, it differs slightly in that if the RITS Transformation target
is the Record ID only, the PythonUDFRecord will have only the .record_id
attribute to work with.
For a python code snippet RITS, a function named transform_identifier
is required, with a single unnamed, passed argument of a PythonUDFRecord instance. An example may look like the following:
# ability to import modules as needed (just for demonstration)
import re
import time
# function named `transform_identifier`, with single passed argument of PythonUDFRecord instance
def transform_identifier(record):
'''
In this example, a string replacement is performed on the record identifier,
but this could be much more complex, using a combination of the Record's parsed
XML and/or the Record Identifier. This example is meant ot show the structure of a
python based RITS only.
'''
# function must return string of new Record Identifier
return record.record_id.replace('digital.library.wayne.edu','goober.tronic.org')
And a screenshot of this RITS in action:
XPath Expression¶
Finally, a single XPath expression may be used to extract a new Record Identifier from the Record’s XML record. Note: The input must be the Record’s Document, not the current Record Identifier, as the XPath must have valid XML to retrieve a value from. Below is a an example screenshot:
Combine OAI-PMH Server¶
Combine comes with a built-in OAI-PMH server to serve published Records. Configurations for the OAI server, at this time, are not configured with Django’s admin, but may be found in combine/localsettings.py
. These settings include:
OAI_RESPONSE_SIZE
- How many records to return per OAI paged responseCOMBINE_OAI_IDENTIFIER
- It is common for OAI servers (producers) to prefix Record identifiers on the way out with an identifier unique to the server. This setting can also be configured to mirror the identifier used in other/previous OAI servers to mimic downstream identifiers
DPLA Bulk Data Downloads (DBDD)¶
One of the more experimental features of Combine is to compare the Records from a Job (or, of course, multiple Jobs if they are Merged into one) against a bulk data download from DPLA.
To use this function, S3 credentials must but added to the combine/localsettings.py
settings file that allow for downloading of bulk data downloads from S3. Once added, and Combine restarted, it is possible to download previous bulk data dumps. This can be done from the configuration page by clicking on “Download and Index Bulk Data”, then selecting a bulk data download from the long dropdown. When the button is clicked, this data set will be downloaded and indexed locally in ElasticSearch, all as a background task. This will be reflected in the table on the Configuration page as complete when the row reads “Downloaded and Indexed”:
Comparison can be triggered from any Job’s optional parameters under the tab DPLA Bulk Data Compare. Comparison is performed by attempting to match a Record’s Record Identifier to the _id
field in the DPLA Item document.
Because this comparison is using the Record Identifier for matching, this is a great example of where a Record Identifier Transformation Scenario (RITS) can be a powerful tool to emulate or recreate a known or previous identifier pattern. So much so, it’s conceivable that passing a RITS along with the DPLA Bulk Data Compare – just to temporarily transform the Record Identifier for comparison’s sake, but not in the Combine Record itself – might make sense.
Workflows and Viewing Results¶
This section will describe different parts of workflows for running, viewing detailed results, and exporting Jobs and Records.
Sub-sections include:
Record Versioning¶
In an effort to preserve various stages of a Record through harvest, possible multiple transformation, merges and sub-dividing, Combine takes the approach of copying the Record each time.
As outlined in the Data Model, Records are represented in both MongoDB and ElasticSearch. Each time a Job is run, and a Record is duplicated, it gets a new document in Mongo, with the full XML of the Record duplicated. Records are associated with each other across Jobs by their Combine ID.
This approach has pros and cons:
- Pros
- simple data model, each version of a Record is stored separately
- each Record stage can be indexed and analyzed separately
- Jobs containing Records can be deleted without effecting up/downstream Records (they will vanish from the lineage)
- Cons
- duplication of data is potentially unnecessary if Record information has not changed
Running Jobs¶
Note: For all Jobs in Combine, confirm that an active Livy session is up and running before proceeding.
All Jobs are tied to, and initiated from, a Record Group. From the Record Group page, at the bottom you will find buttons for starting new jobs:
Clicking any of these Job types will initiate a new Job, and present you with the options outlined below.
Optional Parameters¶
When running any type of Job in Combine, you are presented with a section near the bottom for Optional Parameters for the job:
These options are split across various tabs, and include:
For the most part, a user is required to pre-configure these in the Configurations section, and then select which optional parameters to apply during runtime for Jobs.
Record Input Filters¶
When running a new Transform or Duplicate/Merge Job, which both rely on other Jobs as Input Jobs, filters can be applied to filter incoming Records. These filters are settable via the “Record Input Filter” tab.
There are two ways in which filters can be applied:
- “Globally”, where all filters are applied to all Jobs
- “Job Specific”, where a set of filters can be applied to individual Jobs, overriding any “Global” filters
Setting filters for individual Jobs is performed by clicking the filter icon next to a Job’s checklist in the Input Job selection table:
This will bring up a modal window where filters can be set for that Job, and that Job only. When the modal window is saved, and filters applied to that Job, the filter icon will turn orange indicating that Job has unique filters applied:
When filters are applied to specific Jobs, this will be reflected in the Job lineage graph:
and the Input Jobs tab for the Job as well:
Currently, the following input Record filters are supported:
- Filter by Record Validity
- Limit Number of Records
- Filter Duplicates
- Filter by Mapped Fields
Filter by Record Validity
Users can select if all, valid, or invalid Records will be included.
Below is an example of how those valves can be applied and utilized with Merge Jobs to select only only valid or invalid records:
Keep in mind, if multiple Validation Scenarios were run for a particular Job, it only requires failing one test, within one Validation Scenario, for the Record to be considered “invalid” as a whole.
Limit Number of Records
Arguably the simplest filter, users can provide a number to limit total number of Records that will be used as input. This numerical filter is applied after other filters have been applied, and the Records from each Input Job have been mixed. Given Input Jobs A
, B
, and C
, all with 1,000 Records, given a numerical limit of 50, it’s quite possible that all 50 will come from Job A
, and 0 from B
and C
.
This filter is likely most helpful for testing and sampling.
Filter Duplicates
Optionally, remove duplicate Records based on matching record_id
values. As these are used for publishing, this can be a way to ensure that Records are not published with duplicate record_id
.
Filter by Mapped Fields
Users can provide an ElasticSearch DSL query, as JSON, to refine the records that will be used for this Job.
Take, for example, an input Job of 10,000 Records that has a field foo_bar
, and 500 of those Records have the value baz
for this field. If the following query is entered here, only the 500 Records that are returned from this query will be used for the Job:
{
"query":{
"match":{
"foo_bar":"baz"
}
}
}
This ability hints at the potential for taking the time to map fields in interesting and helpful ways, such that you can use those mapped fields to refine later Jobs by. ElasticSearch queries can be quite powerul and complex, and in theory, this filter will support any query used.
Field Mapping Configuration¶
Combine maps a Record’s original document – likely XML – to key/value pairs suitable for ElasticSearch with a library called XML2kvp
. When running a new Job, users can provide parameters to the XML2kvp
parser in the form of JSON.
Here’s an example of the default configurations:
{
"add_literals": {},
"concat_values_on_all_fields": false,
"concat_values_on_fields": {},
"copy_to": {},
"copy_to_regex": {},
"copy_value_to_regex": {},
"error_on_delims_collision": false,
"exclude_attributes": [],
"exclude_elements": [],
"include_all_attributes": false,
"include_attributes": [],
"node_delim": "_",
"ns_prefix_delim": "|",
"remove_copied_key": true,
"remove_copied_value": false,
"remove_ns_prefix": false,
"self_describing": false,
"skip_attribute_ns_declarations": true,
"skip_repeating_values": true,
"split_values_on_all_fields": false,
"split_values_on_fields": {}
}
Clicking the button “What do these configurations mean?” will provide information about each parameter, pulled form the XML2kvp
JSON schema.
The default is a safe bet to run Jobs, but configurations can be saved, retrieved, updated, and deleted from this screen as well.
Additional, high level discussion about mapping and indexing metadata can also be found here.
Validation Tests¶
One of the most commonly used optional parameters would be what Validation Scenarios to apply for this Job. Validation Scenarios are pre-configured validations that will run for each Record in the Job. When viewing a Job’s or Record’s details, the result of each validation run will be shown.
The Validation Tests selection looks like this for a Job, with checkboxes for each pre-configured Validation Scenarios (additionally, checked if the Validation Scenario is marked to run by default):
Transform Identifier¶
When running a Job, users can optionally select a Record Identifier Transformation Scenario (RITS) that will modify the Record Identifier for each Record in the Job.
DPLA Bulk Data Compare¶
One somewhat experimental feature is the ability to compare the Record’s from a Job against a downloaded and indexed bulk data dump from DPLA. These DPLA bulk data downloads can be managed in Configurations here.
When running a Job, a user may optionally select what bulk data download to compare against:
Viewing Job Details¶
One of the most detail rich screens are the results and details from a Job run. This section outlines the major areas. This is often referred to as the “Job Details” page.
At the very top of an Job Details page, a user is presented with a “lineage” of input Jobs that relate to this Job:
Also in this area is a button “Job Notes” which will reveal a panel for reading / writing notes for this Job. These notes will also show up in the Record Group’s Jobs table.
Below that are tabs that organize the various parts of the Job Details page:
Records¶
This table shows all Records for this Job. It is sortable and searchable (though limited to what fields), and contains the following fields:
DB ID
- Record’s ObjectID in MongoDBCombine ID
- identifier assigned to Record on creation, sticks with Record through all stages and JobsRecord ID
- Record identifier that is acquired, or created, on Record creation, and is used for publishing downstream. This may be modified across Jobs, unlike theCombine ID
.Originating OAI set
- what OAI set this record was harvested as part ofUnique
- True/False if theRecord ID
is unique in this JobDocument
- link to the Record’s raw, XML document, blank if errorError
- explanation for error, if any, otherwise blankValidation Results
- True/False if the Record passed all Validation Tests, True if none run for this Job
In many ways, this is the most direct and primary route to access Records from a Job.
Mapped Fields¶
This tab provides a table of all indexed fields for this job, the nature of which is covered in more detail here:
Re-Run¶
Jobs can be re-run “in place” such that all current parameters, applied scenarios, and linkages to other jobs are maintained. All “downstream” Jobs – Jobs that inherit Records from this Job – are also automatically re-run.
One way to think about re-running Jobs would be to think of a group of Jobs that that inherit Records from one another as a “pipeline”.
Jobs may also be re-run, as well as in bulk with other Jobs, from a Record Group page.
More information can be found here: Re-Running Jobs documentation.
Publish¶
This tab provides the means of publishing a single Job and its Records. This is covered in more detail in the Publishing section.
Input Jobs¶
This table shows all Jobs that were used as input Jobs for this Job.
Validation¶
This tab shows the results of all Validation tests run for this Job:
For each Validation Scenario run, the table shows the name, type, count of records that failed, and a link to see the failures in more detail.
More information about Validation Results can be found here.
DPLA Bulk Data Matches¶
If a DPLA bulk data download was selected to compare against for this Job, the results will be shown in this tab.
The following screenshot gives a sense of what this looks like for a Job containing about 250k records, that was compared against a DPLA bulk data download of comparable size:
This feature is still somewhat exploratory, but Combine provides an ideal environment and “moment in time” within the greater metadata aggregation ecosystem for this kind of comparison.
In this example, we are seeing that 185k Records were found in the DPLA data dump, and that 38k Records appear to be new. Without an example at hand, it is difficult to show, but it’s conceivable that by leaving Jobs in Combine, and then comparing against a later DPLA data dump, one would have the ability to confirm that all records do indeed show up in the DPLA data.
Spark Details¶
This tab provides helpful diagnostic information about the Job as run in in the background in Spark.
Spark Jobs/Tasks Run
Shows the actual tasks and stages as run by Spark. Due to how Spark runs, the names of these tasks may not be familiar or immediately obvious, but provide a window into the Job as it runs. This section also shows additioanl tasks that have been run for this Job such as re-indexing, or new validations.
Livy Statement Information
This section shows the raw JSON output from the Job as submitted to Apache Livy.
Job Type Details - Jobs¶
For each Job type – Harvest
, Transform
, Merge/Duplicate
, and Analysis
– the Job details screen provides a tab with information specific to that Job type.
All Jobs contain a section called Job Runtime Details that show all parameters used for the Job:
OAI Harvest Jobs
Shows what OAI endpoint was used for Harvest.
Static Harvest Jobs
No additional information at this time for Static Harvest Jobs.
Transform Jobs
The “Transform Details” tab shows Records that were transformed during the Job in some way. For some Transformation Scenarios, it might be assumed that all Records will be transformed, but others, may only target a few Records. This allows for viewing what Records were altered.
Clicking into a Record, and then clicking the “Transform Details” tab at the Record level, will show detailed changes for that Record (see below for more information).
Merge/Duplicate Jobs
No additional information at this time for Merge/Duplicate Jobs.
Analysis Jobs
No additional information at this time for Analysis Jobs.
Export¶
Records from Jobs may be exported in a variety of ways, read more about exporting here.
Record Documents
Exporting a Job as Documents takes the stored XML documents for each Record, distributes them across a user-defined number of files, exports as XML documents, and compiles them in an archive for easy downloading.
For example, 1000 records where a user selects 250 per file, for Job #42
, would result in the following structure:
- archive.zip|tar
- j42/ # folder for Job
- part00000.xml # each XML file contains 250 records grouped under a root XML element <documents>
- part00001.xml
- part00002.xml
- part00003.xml
The following screenshot shows the actual result of a Job with 1,070 Records, exporting 50 per file, with a zip file and the resulting, unzipped structure:
Why export like this? Very large XML files can be problematic to work with, particularly for XML parsers that attempt to load the entire document into memory (which is most of them). Combine is naturally pre-disposed to think in terms of the parts and partitions with the Spark back-end, which makes for convenient writing of all Records from Job in smaller chunks. The size of the “chunk” can be set by specifying the XML Records per file
input in the export form. Finally, .zip or .tar files for the resulting export are both supported.
When a Job is exported as Documents, this will send users to the Background Tasks screen where the task can be monitored and viewed.
Viewing Record Details¶
At the most granular level of Combine’s data model is the Record. This section will outline the various areas of the Record details page.
The table at the top of a Record details page provides identifier information:
Similar to a Job details page, the following tabs breakdown other major sections of this Record details.
Record XML¶
This tab provides a glimpse at the raw, XML for a Record:
Note also two buttons for this tab:
View Document in New Tab
This will show the raw XML in a new browser tabSearch for Matching Documents
: This will search all Records in Combine for other Records with an identical XML document
Indexed Fields¶
This tab provides a table of all indexed fields in ElasticSearch for this Record:
Notice in this table the columns DPLA Mapped Field
and Map DPLA Field
. Both of these columns pertain to a functionality in Combine that attempts to “link” a Record with the same record in the live DPLA site. It performs this action by querying the DPLA API (DPLA API credentials must be set in localsettings.py
) based on mapped indexed fields. Though this area has potential for expansion, currently the most reliable and effective DPLA field to try and map is the isShownAt
field.
The isShownAt
field is the URL that all DPLA items require to send visitors back to the originating organization or institution’s page for that item. As such, it is also unique to each Record, and provides a handy way to “link” Records in Combine to items in DPLA. The difficult part is often figuring out which indexed field in Combine contains the URL.
Note: When this is applied to a single Record, that mapping is then applied to the Job as a whole. Viewing another Record from this Job will reflect the same mappings. These mappings can also be applied at the Job or Record level.
In the example above, the indexed field mods_location_url_@usage_primary
has been mapped to the DPLA field isShownAt
which provides a reliable linkage at the Record level.
Record Stages¶
This table show the various “stages” of a Record, which is effectively what Jobs the Record also exists in:
Records are connected by their Combine ID (combine_id
). From this table, it is possible to jump to other, earlier “upstream” or later “downstream”, versions of the same Record.
Record Validation¶
This tab shows all Validation Tests that were run for this Job, and how this Record fared:
More information about Validation Results can be found here.
DPLA Link¶
When a mapping has been made to the DPLA isShownAt
field from the Indexed Fields tab (or at the Job level), and if a DPLA API query is successful, a result will be shown here:
Results from the DPLA API are parsed and presented here, with the full API JSON response at the bottom (not pictured here). This can be useful for:
- confirming existence of a Record in DPLA
- easily retrieving detailed DPLA API metadata about the item
- confirming that changes and transformations are propagating as expected
Job Type Details - Records¶
For each Job type – Harvest
, Transform
, Merge/Duplicate
, and Analysis
– the Record details screen provides a tab with information specific to that Job type.
Harvest Jobs
No additional information at this time for Harvest Jobs.
Transform Jobs
This tab will show Transformation details specific to this Record.
The first section shows the Transformation Scenario used, including the transformation itself, and the “input” or “upsteram” Record that was used for the transformation:
Clicking the “Re-run Transformation on Input Record” button will send you to the Transformation Scenario preview page, with the Transformation Scenario and Input Record automatically selected.
Further down, is a detailed diff between the input and output document for this Record. In this minimal example, you can observe that Juvenile
was changed to Youth
in the Transformation, resulting in only a couple of isolated changes:
For transformations where the Record is largely re-written, the changes will be lengthier and more complex:
Users may also click the button “View Side-by-Side Changes” for a GitHub-esque, side-by-side diff of the Input Record and the Current Record (made possible by the sxsdiff library):
Merge/Duplicate Jobs
No additional information at this time for Merge/Duplicate Jobs.
Analysis Jobs
No additional information at this time for Analysis Jobs.
Managing Jobs¶
Once you work through initiating the Job, configuring the optional parameters outlined below, and running it, you will be returned to the Record Group screen and presented with the following job lineage “graph” and a table showing all Jobs for the Record Group:
The graph at the top shows all Jobs for this Record Group, and their relationships to one another. The edges between nodes show how many Records were used as input for the target Job, what – if any – filters were applied. This graph is zoomable and clickable. This graph is designed to provide some insight and context at a glance, but the table below is designed to be more functional.
The table shows all Jobs, with optional filters and a search box in the upper-right. The columns include:
Job ID
- Numerical Job ID in CombineTimestamp
- When the Job was startedName
- Clickable name for Job that leads to Job details, optionally given one by user, or a default is generated. This is editable anytime.Organization
- Clickable link to the Organization this Job falls underRecord Group
- Clickable link to the Record Group this Job falls under (as this table is reused throughout Combine, it can sometimes contain Jobs from other Record Groups)Job Type
- Harvest, Transform, Merge, or AnalysisLivy Status
- This is the status of the Job in Livy
gone
- Livy has been restarted or stopped, and no information about this Job is availableavailable
- Livy reports the Job as complete and availablewaiting
- The Job is queued behind others in Livyrunning
- The Job is currently running in LivyFinished
- Though Livy does the majority of the Job processing, this indicates the Job is finished in the context of CombineIs Valid
- True/False, True if no validations were run or all Records passed validation, False if any Records failed any validationsPublishing
- Buttons for Publishing or Unpublishing a JobElapsed
- How long the Job has been running, or tookInput
- All input Jobs used for this JobNotes
- Optional notes field that can be filled out by User here, or in Job DetailsTotal Record Count
- Total number of successfully processed RecordsActions
- Buttons for Job details, or monitoring status of Job in Spark (see Spark and Livy documentation for more information)
This graph and table represents Jobs already run, or running. This is also where Jobs can be moved, stopped, deleted, rerun, even cloned. This is performed by using the bank of buttons under “Job Management”:
All management options contain a slider titled “Include Downstream” that defaults to on or off, depending on the task. When on for a particular task, this will analyze the lineage of all selected Jobs and determine which are downstream and include them in the action being peformed (e.g. moving, deleting, rerunning, etc.)
The idea of “downstream” Jobs, and some of the actions like Re-Running and Cloning introduce another dimension to Jobs and Records in Combine, that of Pipelines.
Pipelines¶
What is meant by “downstream” Jobs? Take the interconnected Jobs below:
In this example, the OAI Harvest Job A
is the “root” or “origin” Job of this lineage. This is where Records were first harvested and created in Combine (this might also be static harvests, or other forms of importing Records yet to come). All other Jobs in the lineage – B
, C
, D
, and E
– are considered “downstream”. From the point of view of A
, there is a single pipeline. If a user were to reharvest A
, potentially adding, removing, or modifying Records in that Job, this has implications for all other Jobs that either got Records from A
, or got Records from Jobs that got Records from A
, and so forth. In that sense, Jobs are “downstream” if changes to an “upstream” Job would potentially change their own Records.
Moving to B
, only one Job is downstream, D
. Looking at C
, there are two downstreams Jobs, D
and E
. Looking again at the Record Group lineage, we can see then that D
has two upstream Jobs, B
and C
. This can be confirmed by looking at the “Input Jobs” tab for D
:
Why are there zero Records coming from C
as an Input Job? Looking more closely at this contrived example, and the input filters applied to Jobs B
and C
, we see that “De-Dupe Records” is true for both. We can infer that Jobs B
and C
provided Records with the same record_id
, and as a result, were all de-duped – skipped – from C
during the Merge.
Another view of the lineage for D
, from it’s perspective, can be seen at the top of the Job details page for D
, confirming all this:
Getting back to the idea of pipelines and Job management, what would happend if we select A
and click the “Re-Run Selected Jobs” button, with “Include Downstream” turned on? Jobs A
-E
would be slated for re-running, queuing in order to ensure that each Jobs is getting updated Records from each upstream Job:
We can see that status changed for each Job (potentially after a page refresh), and the Jobs will re-run in order.
We also have the ability to clone Jobs, including or ignoring downstream Jobs. The following is an example of cloning C
, not including downstream Jobs:
Under the hood, all validations, input filters, and parameters that were set for C
are copied to the new Job C (CLONED)
, but because downstream Jobs were not included, D
and E
were not cloned. But if we were to select downstream Jobs from C
when cloning, we’d see something that looks like this:
Woah there! Why the line from B
to the newly created cloned Job D (CLONE)
? D
was downstream from C
during the clone, so was cloned as well, but still required input from B
, which was not cloned. We can imagine that B
might be a group of Records that rarely change, but are required in our pursuits, and so that connection is persisted.
As one final example of cloning, to get a sense about Input Jobs for Jobs that are cloned, versus those that are not, we can look at the example of cloning A
, including all its downstream Jobs:
Because A
has every job in this view as downstream, cloning A
essentially clones the entire “pipeline” and creates a standalone copy. This could be useful for cloning a pipeline to test re-running the entire thing, where it is not desirable to risk the integrity of the pipeline before knowing if it will be successful.
Finally, we can see that the “Include Downstream” applied to other tasks as well, e.g. deleting, where we have selected to delete A (CLONE)
and all downstream Jobs:
“Pipelines” are not a formal structure in Combine, but can be a particularly helpful way to think about a “family” or “lineage” of connected Jobs. The ability to re-run and clone Jobs came later in the data model, but with the addition of granular control of input filters for Input Jobs, can prove to be extremely helpful for setting up complicated pipelines of interconnected Jobs that can be reused.
Harvesting Records¶
Harvesting is how Records are first introduced to Combine. Like all Jobs, Harvest Jobs are run from the the Record Group overview page.
The following will outline specifics for running Harvest Jobs, with more general information about running Jobs here.
OAI-PMH Harvesting¶
OAI-PMH harvesting in Combine utilizes the Apache Spark OAI harvester from DPLA’s Ingestion 3 engine.
Before running an OAI harvest, you must first configure an OAI Endpoint in Combine that will be used for harvesting from. This only needs to be done once, and can then be reused for future harvests.
From the Record Group page, click the “Harvest OAI-PMH” button at the bottom.
Like all Jobs, you may optionally give the Job a name or add notes.
Below that, indicated by a green alert, are the required parameters for an OAI Job. First, is to select your pre-configured OAI endpoint. In the screenshot below, an example OAI endpoint has been selected:
Default values for harvesting are automatically populated from your configured endpoint, but can be overridden at this time, for this harvest only. Changes are not saved for future harvests.
Once configurations are set, click “Run Job” at the bottom to harvest.
Identifiers for OAI-PMH harvesting¶
As an Harvest type Job, OAI harvests are responsible for creating a Record Identifier (record_id
) for each Record. The record_id
is pulled from the record/header/identifier
field for each Record harvested.
As you continue on your metadata harvesting, transforming, and publishing journey, and you are thinking about how identifiers came to be, or might be changed, this is a good place to start from to see what the originating identifier was.
Static File Harvest¶
It is also possible to harvest Records from static sources, e.g. XML uploads. Combine uses Databricks Spark-XML to parse XML records from uploaded content. This utilizes the powerful globbing capabilities of Hadoop for locating XML files. Users may also provide a location on disk as opposed to uploading a file, but this is probably less commonly used, and the documentation will focus on uploads.
Using the Spark-XML library provides an efficient and powerful way of locating and parsing XML records, but it does so in a way that might be unfamiliar at first. Instead of providing XPath expressions for locating Records, only the XML Record’s root element is required, and the Records are located as raw strings.
For example, a MODS record that looks like the following:
<mods:mods>
<mods:titleInfo>
<mods:title>Amazing Record of Incalculable Worth</mods:title>
</mods:titleInfo>
...
...
</mods:mods>
Would need only the following Root XML element string to be found: mods:mods
. No angle brackets, no XPath expressions, just the element name!
However, a close inspect reveals this MODS example record does not have the required namespace declaration, xmlns:mods="http://www.loc.gov/mods/v3"
. It’s possible this was declared in a different part of the XML Record. Because Spark-XML locates XML records more as strings, as opposed to parsed documents, Combine also allows users to include an XML root element declaration that will be used for each Record found. For this example, the following could be provided:
xmlns:mods="http://www.loc.gov/mods/v3"
Which would result in the following, final, valid XML Record in Combine:
<mods:mods xmlns:mods="http://www.loc.gov/mods/v3">
<mods:titleInfo>
<mods:title>Amazing Record of Incalculable Worth/mods:title>
</mods:titleInfo>
...
...
</mods:mods>
Once a file has been selected for uploading, and these required parameters are set, click “Run Job” at the bottom to harvest.
Is this altering the XML records that I am providing Combine?¶
The short answer is, yes. But, it’s important to remember that XML files are often altered in some way when parsed and re-serialized. Their integrity is not character-by-character similarlity, but what data can be parsed. This approach only alters the declarations in the root XML element.
Uploads to Combine that already include namespaces, and all required declarations, at the level of each individual Record, do not require this re-writing and will leave the XML untouched.
What kind of files and/or structures can be uploaded?¶
Quite a few! Static harvests will scour what is uploaded – through a single XML file, across multiple files within a zipped or tarred archive file, even recursively through directories if they are present in an archive file – for the root XML element, e.g. mods:mods
, parsing each it encounters.
Examples include:
- METS file with metadata in
<dmdSec>
sections- zip file of directories, each containing multiple XML files
- single MODS XML file, that contains multiple MODS records
- though not encouraged, even a
.txt
file with XML strings contained therein!
Identifiers for Static harvesting¶
For static harvests, identifiers can be created in one of two ways:
- by providing an XPath expression to retrieve a string from the parsed XML record
- a random, UUID is assigned based on a hash of the XML record as a string
Transforming Records¶
Transformation Jobs are how Records are transformed in some way in Combine; all other Jobs merely copy and/or analyze Records, but Transformation Jobs actually alter the Record’s XML that is stored in MySQL
The following will outline specifics for running Transformation Jobs, with more general information about running Jobs here.
Similar to Harvest Jobs, you must first configure a Transformation Scenario that will be selected and used when running a Transformation Job.
The first step is to select a single input Job to supply the Records for transformation:
Next, will be selecting your pre-configured Transformation Scenario:
As most of the configuration is done in the Transformation Scenario, there is very little to do here! Select optional parameters and click “Run Job” at the bottom.
Merging Records¶
The following will outline specifics for running Merge / Duplicate Jobs, with more general information about running Jobs here.
“Merge / Duplicate” Jobs are precisely what they sound like: they are used to copy, merge, and/or duplicate Records in Combine. They might also be referred to as simply “Merge” Jobs throughout.
To run a Merge Job, select “Duplicate / Merge Job” from the Record Group screen.
From the Merge Job page, you may notice there are no required parameters! However, unlike Transformation Jobs where input Jobs could only be selected from the same Record Group, the input Job selection screen will show Jobs from across all Organizations and Record Groups:
There not many Jobs in this instance, but this could get intimidating if lots of Jobs were present in Combine. Both the Job “lineage graph” – visual graph near the top – or the table of Jobs can be useful for limiting.
Clicking on a single Job in the lineage graph will filter the table to include only that Job, and all Jobs there input for that Job, and graying out Jobs in the lineage graph to represent the same. Clicking outside of a Job will clear the filter.
Additionally, filters from the Jobs table can be used to limit by Organization, Record Group, Job Type, Status, or even keyword searching. When filters are applied, Jobs will be grayed out in the lineage graph.
Also of note, you can select multiple Jobs for Merge / Duplicate Jobs. When Jobs are merged, a duplicate check is run for the Record Identifiers only.
Select desired Jobs to merge or duplicate – which can be a single Job – and click “Run Job”.
The following screenshot shows the results of a Merge Job with two input Jobs from the Record Group screen:
Why Merge or Duplicate Records?¶
With the flexibility of the data model,
Organization --> Record Group --> Job --> Record
comes some complexity in execution.
Merge Jobs have a variety of possible use cases:
- duplicate a Job solely for analysis purposes
- with a single Record Group, optionally perform multiple, small harvests, but eventually merge them in preparation for publishing
- Merge Jobs are actually what run behind the scenes for Analysis Jobs
- Merge Jobs are the only Job type that can pull Jobs from across Organizations or Record Groups
- shunt a subset of valid or invalid records from Job for more precise transformations or analysis
As mentioned above, one possible use of Merge / Duplicating Jobs would be to utilize the “Record Input Validity Valve” option to shunt valid or invalid Records into a new Job. In this possible scenario, you could:
- from
Job A
, select only invalid Records to createJob B
- assuming
Job B
fixed those validation problems, merge valid Records fromJob A
with now valid Records fromJob B
to createJob C
This can be helpful if Job A
is quite large, but only has a few Records that need further transformation, or, the Transformation that will fix invalid Records, would break – invalidate – other perfectly good Records from Job A
. Here is a visual sense of this possible workflow, notice the record counts for each edge:
Publishing Records¶
The following will outline specifics for Publishing a Record Group, with more general information about running Jobs here.
How does Publishing work in Combine?¶
As a tool for aggregating metadata, Combine must also have the ability to serve or distribute aggregated Records again. This is done by “publishing” in Combine, which happens at the Job level.
When a Job is published, a user may a Publish Set Identifier (publish_set_id
) that is used to aggregate and group published Records. For example, in the built-in OAI-PMH server, that Publish Set Identifier becomes the OAI set ID, or for exported flat XML files, the publish_set_id
is used to create a folder hierarchy. Multiple Jobs can publish under the same Publish Set ID, allowing for grouping of materials when publishing.
On the back-end, publishing a Job adds a flag to Job that indicates it is published, with an optional publish_set_id
. Unpublishing removes these flags, but maintains the Job and its Records.
Currently, the the following methods are avaialable for publishing Records from Combine:
Publishing a Job¶
Publishing a Job can be initated one of two ways: from the Record Group’s list of Jobs which contains a column called “Publishing”:
Or the “Publish” tab from a Job’s details page. Both point a user to the same screen, which shows the current publish status for a Job.
If a Job is unpublished, a user is presented with a field to assign a Publish Set ID and publish a Job:
If a Job is already published, a user is presented with information about the publish status, and the ability to unpublish:
Both publishing and unpublishing will run a background task.
Note: When selecting a Publish Set ID, consider that when the Records are later harvested from Combine, this Publish Set ID – at that point, an OAI set ID – will prefix the Record Identifier to create the OAI identifier. This behavior is consistent with other OAI-PMH aggregators / servers like REPOX. It is good to consider what OAI sets these Records have been published under in the past (thereby effecting their identifiers), and/or special characters should probably be avoided.
Identifiers during metadata aggregation is a complex issue, and will not be addressed here, but it’s important to note that the Publish Set ID set during Publishing Records in Combine will have bearing on those considerations.
Viewing Publishing Records¶
All published Records can be viewed from the “Published” section in Combine, which can be navigated to from a consistent link at the top of the page.
The “Published Sets” section in the upper-left show all published Jobs:
As can be seen here, two Jobs are published, both from the same Record Group, but with different Publish Set IDs.
To the right, is an area called “Analysis” that allows for running an Analysis Job over all published records. While this would be possible from a manually started Analysis Job, carefully selecting all Publish Jobs throughout Combine, this is a convenience option to begin an Analysis Jobs with all published Records as input.
Below these two sections is a table of all published Records. Similar to tables of Records from a Job, this table also contains some unique columns specific to Published Records:
Outgoing OAI Set
- the OAI set, aka the Publish Set ID, that the Record belongs toHarvested OAI Set
- the OAI set that the Record was harvested under (empty if not harvested via OAI-PMH)Unique Record ID
- whether or not the Record ID (record_id
) is unique among all Published Records
Next, there is a now hopefully familiar breakdown of mapped fields, but this time, for all published Records.
While helpful in the Job setting, this breakdown can be particularly helpful for analyzing the distribution of metadata across Records that are slated for Publishing.
For example: determining if all records have an access URL. Once the mapped field has been identified as where this information should be – in this case mods_location_url_@usage=primary
– we can search for this field and confirm that 100% of Records have a value for that mapped field.
More on this in Analyzing Indexed Fields Breakdown.
OAI-PMH Server¶
Combine comes with a built-in OAI-PMH server that serves records directly from the MySQL database via the OAI-PMH protocol. This can be found under the “Outgoing OAI-PMH Server” tab:
Export Flat Files¶
Another way to “publish” or distribute Records from Combine is by exporting flat files of Record XML documents as an archive file. This can be done by clicking the “Export” tab and then “Export Documents”. Read more about exporting here.
Publish Set IDs will be used to organzize the exported XML files in the resulting archive file. For example, if a single Job was published under the Publish ID foo
, and two Jobs were published under the Publish ID bar
, and the user specified 100
Record per file, the resulting export structure would look similar to this:
Analysis¶
In addition to supporting the actual harvesting, transformation, and publishing of metadata for aggregation purposes, Combine strives to also support the analysis of groups of Records. Analysis may include looking at the use of metadata fields across Records, or viewing the results of Validation tests performed across Records.
This section will describe some areas of Combine related to analysis. This includes Analysis Jobs proper, a particular kind of Job in Combine, and analysis more broadly when looking at the results of Jobs and their Records.
Analysis Jobs¶
Analysis Jobs are a bit of an island. On the back-end, they are essentially Duplicate / Merge Jobs, and have the same input and configuration requirements. They can pull input Jobs from across Organizations and Records Groups.
Analysis Jobs differ in that they do not exist within a Record Group. They are imagined to be ephemeral, disposable Jobs used entirely for analysis purposes.
You can see previously run, or start a new Analysis Job, from the “Analysis” link from the top-most navigation.
Below, is an example of an Analysis Job comparing two Jobs, from different Record Groups. This ability to pull Jobs from different Record Groups is shared with Merge Jobs. You can see only one Job in the table, but the entire lineage of what Jobs contribute to this Analysis Job. When the Analysis Job is deleted, none of the other Jobs will be touched (and currently, they are not aware of the Analysis Job in their own lineage).
Analyzing Indexed Fields¶
Undoubtedly one of Combine’s more interesting, confusing, and potentially powerful areas is the indexing of Record’s XML into ElasticSearch. This section will outline how that happens, and some possible insights that can be gleamed from the results.
How and Why?¶
All Records in Combine store their raw metadata as XML in MySQL. With that raw metadata, are some other fields about validity, internal identifiers, etc., as they relate to the Record. But, because the metadata is still an opaque XML “blob” at this point, it does not allow for inspection or analysis. To this end, when all Jobs are run, all Records are also indexed in ElasticSearch.
As many who have worked with complex metadata can attest to, flattening or mapping hierarchical metadata to a flat document store like ElasticSearch or Solr is difficult. Combine approaches this problem by generically flattening all elements in a Record’s XML document into XPath paths, which are converted into field names that are stored in ElasticSearch. This includes attributes as well, further dynamically defining the ElasticSearch field name.
For example, the following XML metadata element:
<mods:accessCondition type="useAndReproduction">This book is in the public domain.</mods:accessCondition>
would become the following ElasticSearch field name:
mods_accessCondition_@type_useAndReproduction
While mods_accessCondition_@type_useAndReproduction
is not terribly pleasant to look at, it’s telling where this value came from inside the XML document. And most importantly, this generic XPath flattening approach can be applied across all XML documents that Combine might encounter.
When running Jobs, users can select what “Index Mapper” to use, and a user may notice in addition to the Generic XPath based mapper
, which is outlined above, Combine also ships with another mapper called Custom MODS mapper
. This is mentioned to point out that other, custom mappers could be created and used if desired.
The Custom MODS mapper
is based on an old XSLT flattening map from MODS to Solr that early versions of Islandora used. The results from this mapper result in far fewer indexed fields, which has pros and cons. If the mapping is known and tightly controlled, this could be helpful for precise analysis of where information is going. But, the generic mapper will – in some way – map all values from the XML record to ElasicSearch for analysis, albeit with unsightly field names. Choices, choices!
Creating a custom mapper would require writing a new class in the file core/spark/es.py
, matching the functionality of a pre-existing mapper like class GenericMapper(BaseMapper)
.
Breakdown of indexed fields for a Job¶
When viewing the details of a Job, the tab “Field Analysis” shows a breakdown of all fields, for all documents in ElasticSearch, from this job in a table. These are essentially facets.
There is a button “Show field analysis explanation” that outlines what the various columns mean:
All columns are sortable, and some are linked out to another view that drills further into that particular field. One way to drill down into a field is to click on the field name itself. This will present another view with values from that field. Below is doing that for the field mods_subject_topic
:
At the top, you can see some high-level metrics that recreate numbers from the overview, such as:
- how many documents have this field
- how many do not
- how many total values are there, remembering that a single document can have multiple values
- how many distinct values are there
- percentage of unique (distinct / total values)
- and percentage of all documents that have this field
In the table, you can see actual values for the field, with counts across documents in this Job. In the last column, you can click to see Records that have or do not have this particular value for this particular field.
Clicking into a subject like “fairy tales”, we get the following screen:
At this level, we have the option to click into individual Records.
Validation Tests Results¶
Results for Validation Tests run on a particular Job are communicated in the following ways:
- in the Records Table from a Job’s details page
- a quick overview of all tests performed, and number passed, from a Job’s details page
- exported as an Excel or .csv from a Job’s details page
- results for each Validation test on a Record’s details page
When a Record fails any test from any applied Validation Scenario to its parent Job, it is considered “invalid”. When selecting an input Job for another Job, users have the options of selecting all Records, those that passed all validations tests, or those that may have failed one or more.
The following is a screenshot from a Job Details page, showing that one Validation Scenario was run, and 761 Records failed validation:
Clicking into “See Failures” brings up the resulting screen:
The column Validation Results Payload
contains the message from the Validation Test (results may be generated from Schematron, or Python, and there may be multiple results), and the Failure Count
column shows how many specific tests were failed for that Record (a single Validation Scenario may contain multiple individual tests).
Clicking into a single Record from this table will reveal the Record details page, which has its own area dedicated to what Validation Tests it may have failed:
From this screen, it is possible to Run the Validation and receive the raw results from the “Run Validation” link:
Or, a user can send this single Record to the Validation testing area to re-run validation scenarios, or test new ones, by clicking the “Test Validation Scenario on this Record” button. From this page, it is possible select pre-existing Validation Scenarios to apply to this Record in real-time, users can then edit those to test, or try completely new ones (see Validation Scenarios for more on testing):
Re-Running Jobs¶
Overview¶
Jobs in Combine may be “re-run” in a way that makes a series of interlinked Jobs resemble that of a “pipeline”. This functionality can be particularly helpful when a series of harvests, merges, validations, transformations, and other checks, will be repeated in the future, but with new and/or refresh records.
When a Job is re-run, the following actions are performed in preparation:
- all Records for that Job are dropped from the DB
- all mapped fields in ElasticSearch are dropped (the ElasticSearch index)
- all validations, DPLA bulk data tests, and other information that is based on the Records are removed
However, what remains is important:
- the Job ID, name, notes
- all configurations that were used
- field mapping
- validations applied
- input filters
- etc.
- linkages to other Jobs; what Jobs were used as input, and what Jobs used this Job as input
- publish status of a Job, with corresponding
publish_set_id
(if a Job is published before re-running, the updated Records will automatically publish as well)
Examples¶
Consider the example below, with five Jobs in Combine:
In this figure, Records from an OAI Harvest J1
are used as input for J2
. A subset of these are passed to J3
, perhaps failing some kind of validation, and are fixed then in J4
. J5
is a final merge of the valid records from J2
and J4
, resulting in a final form of the Records. At each hop, there may be various validations and mappings to support the validation and movement of Records.
Now, let’s assume the entire workflow is needed again, but we know that J1
needs to re-harvest Records because there are new or altered Records. Without re-running Jobs in Combine, it would be necessary to recreate each hop in this pipeline, thereby also duplicating the amount of Records. Duplication of Records may be beneficial in some cases, but not alll. In this example, a user would only need to re-run Job J1
, which would trigger all “downstream” Jobs, all the way to Job J5
.
Let’s look at a more realistic example, with actual Combine Jobs. Take the following:
In this “pipeline”:
- two OAI harvests are performed
- all invalid Records are sent to a Transform that fixes validation problems
- all valid Records from that Transform Job, and the original Harvests, are merged together in a final Job
The details of these hops are hidden from this image, but there are validations, field mappings, and other configurations at play here. If a re-harvest is needed for one, or both, of the OAI harvests, a re-run of those Jobs will trigger all “downstream” Jobs, refreshing the Records along the way.
If we were to re-run the two Harvest Jobs, we are immediately kicked back to the Record Group screen, where it can be observed that all Jobs have 0 Records, and are currently running or queued to run:
Exporting¶
Exporting Records¶
Records can be exported in two ways: a series of XML files aggregating the XML document for each Record, or the Mapped Fields for each Record as structured data. Records from a Job, or all Published Records, may be exported. Both are found under the “Export” tab in their respective screens.
Export XML Documents¶
Exporting documents will export the XML document for all Records in a Job or published, distributed across a series of XML files with an optional number of Records per file and a root element <root>
to contain them. This is for ease of working with outside of Combine, where a single XML document containing 50k, 500k, 1m records is cumbersome to work with. The default is 500 Records per file.
You may enter how many records per file, and what kind of compression to use (if any) on the output archive file.
Export Mapped Fields¶
Mapped fields from Records may also be exported, in one of two ways:
- Line-delimited JSON documents (recommended)
- Comma-seperated, tabular .csv file
Both default to exporting all fields, but these may be limited by selecting specific fields to include in the export by clicking the “Select Mapped Fields for Export”.
Both styles may be exported with an optional compression for output.
JSON Documents¶
This is the preferred way to export mapped fields, as it handles characters for field values that may disrupt column delimiters and/or newlines.
Combine uses ElasticSearch-Dump to export Records as line-delimited JSON documents. This library handles well special characters and newlines, and as such, is recommended. This output format also handles multivalued fields and maintains field type (integer, string).
CSV¶
Alternatively, mapped fields can be exported as comma-seperated, tabular data in .csv format. As mentioned, this does not as deftly handle characters that may disrupt column delimiters
If a Record contains a mapped field such as mods_subject_topic
that is repeating, the default export format is to create multiple columns in the export, appending an integer for each instance of that field, e.g.,
mods_subject_topic.0, mods_subject_topic.1, mods_subject_topic.0
history, michigan, snow
But if the checkbox, Export CSV "Kibana style"?
is checked, all multi-valued fields will export in the “Kibana style” where a single column is added to the export and the values are comma separated, e.g.,
mods_subject_topic
history,michigan,snow
Background Tasks¶
Combine includes a section for viewing and managing long running tasks. This can be accessed from the “Background Tasks” link at the top-most navigation from any page.
Note: Background tasks do not include normal workflow Jobs such as Harvests, Transforms, Merges, etc., but do include the following tasks in Combine:
- deleting of Organizations, Record Groups, or Jobs
- generating reports of Validations run for a Job
- exportings Jobs as mapped fields or XML documents
- re-indexing Job with optionally changed mapping parameters
- running new / removing validations from Job
The following screenshot shows a table of all Background Tasks, and their current status:
Clicking on the “Results” button for a single task will take you to details about that task:
The results page for each task type will be slightly different, depending on what further actions may be taken, but an example would be a download link (as in the figure above) for job export or validation reports.
Tests¶
Though Combine is by and large a Django application, it has characteristics that do not lend themselves towards using the built-in Django unit tests. Namely, DB tables that are not managed by Django, and as such, would not be created in the test DB scaffolding that Django tests usually use.
Instead, Combine uses out-of-the-box pytest for unit tests.
Demo data¶
In the directory /tests, some demo data is provided for simulating harvest, transform, merge, and publishing records.
mods_250.xml
- 250 MODS records, as returned from an OAI-PMH response- during testing this file is parsed, and 250 discrete XML files are written to a temp location to be used for a test static XML harvest
mods_transform.xsl
- XSL transformation that performs transformations on the records frommods_250.xml
- during transformation, this XSL file is added as a temporary transformation scenario, then removed post-testing
Running tests¶
Note: Because Combine currently only allows one job to run at a time, and these tests are essentially small jobs that will be run, it is important that no other jobs are running in Combine while running tests.
Tests should be run from the root directory of Combine, if Ansible or Vagrant builds, likely at /opt/combine
. Also requires sourcing the anaconda Combine environment with source activate combine.
It is worth noting whether or not there is an active Livy session already for Combine, or if one should be created and destroyed for testing. Combine, at least at the time of this writing, operates only with a single Livy instance. By default, tests will create and destroy a Livy session, but this can be skipped in favor of using an active session by including the flag --use_active_livy
.
Testing creates a test Organization
, RecordGroup
, and Job
’s during testing. By default, these are removed after testing, but can be kept for viewing or analysis by including the flag --keep_records
.
Examples¶
run tests, no output, create Livy session, destroy records
pytest
run tests, see output, use active Livy session, keep records after test
pytest -s --use_active_livy --keep_records
Command Line¶
Though Combine is designed primarily as a GUI interface, the command line provides a powerful and rich interface to the models and methods that make up the Combine data model. This documentation is meant to expose some of those patterns and conventions.
There are two primary command line contexts:
- Django shell: A shell that loads all Django models, with some additional methods for interacting with Jobs, Records, etc.
- Pyspark shell: A pyspark shell that is useful for interacting with Jobs and Records via a spark context.
These are described in more detail below.
Note: For both, the Combine Miniconda python environement must be used, which can be activated from any filepath location by typing:
source activate combine
Django Python Shell¶
Useful and Example Commands¶
Convenience methods for retrieving instances of Organizations, Record Groups, Jobs, Records
'''
Most all convenience methods are expecting a DB identifier for instance retrieval
'''
# retrieve Organization #14
org = get_o(14)
# retrieve Record Group #18
rg = get_rg(18)
# retrieve Job #308
j = get_j(308)
# retrive Record by id '5ba45e3f01762c474340e4de'
r = get_r('5ba45e3f01762c474340e4de')
# confirm these retrievals
'''
In [2]: org
Out[2]: <Organization: Organization: SuperOrg>
In [5]: rg
Out[5]: <RecordGroup: Record Group: TurboRG>
In [8]: j
Out[8]: <Job: TransformJob @ May. 30, 2018, 4:10:21 PM, Job #308, from Record Group: TurboRG>
In [10]: r
Out[10]: <Record: Record: 5ba45e3f01762c474340e4de, record_id: 0142feb40e122a7764e84630c0150f67, Job: MergeJob @ Sep. 21, 2018, 2:57:59 AM>
'''
Loop through Records in Job and edit Document
This example shows how it would be possible to:
- retrieve a Job
- loop through Records of this Job
- alter Record, and save
This is not a terribly efficient way to do this, but it demonstrates the data model as accessible via the command line for Combine. A more efficient method would be to write a custom, Python snippet Transformation Scenario.
# retrieve Job model instance
In [3]: job = get_j(563)
# loop through records via get_records() method, updating record.document (replacing 'foo' with 'bar') and saving
In [5]: for record in job.get_records():
...: record.document = record.document.replace('foo', 'bar')
...: record.save()
Pyspark Shell¶
The pyspark shell is an instance of Pyspark, with some configurations that allow for loading models from Combine.
Note:
The pyspark shell requires the Hadoop Datanode and Namenode to be active. These are likely running by defult, but in the event they are not, they can be started with the following (Note: the trailing :
is required, as that indicates a group of processes in Supervisor):
sudo supervisorctl restart hdfs:
Note:
The pyspark shell when invoked as described below, will be launched in the same Spark cluster that Combine’s Livy instance uses. Depending on avaialble resources, it’s likely that users will need to stop any active Livy sessions as outlined here to allow this pyspark shell the resources to run.
Useful and Example Commands¶
Open Records from a Job as a Pyspark DataFrame
# import some convenience variables, classes, and functions from core.spark.console
from core.spark.console import *
# retrieve Records from MySQL as pyspark DataFrame
'''
In this example, retrieving records from Job #308
Also of note, must pass spark instance as first argument to convenience method,
which is provided by pyspark context
'''
job_df = get_job_as_df(spark, 308)
# confirm retrieval okay
job_df.count()
...
...
Out[5]: 250
# look at DataFrame columns
job_df.columns
Out[6]:
['id',
'combine_id',
'record_id',
'document',
'error',
'unique',
'unique_published',
'job_id',
'published',
'oai_set',
'success',
'valid',
'fingerprint']
Installing Combine¶
There are two deployment methods explained below. Choose the one that meets your needs. Please be aware that running this system requires not insignificant resources. Required is at least 8GB RAM and 2 processor cores. Combine, at its heart, is a metadata aggregating and processing framework that runs within a software called Django. It requires other components such as Elasticsearch, Spark, among others, in order to work properly. If you are looking to test-drive or develop on Combine, you have arrived at the right place.
Pre-Installation Notes:¶
- Both installation methods listed below assume an Ubuntu 18.04 server
- For either installation, there are a host of variables that set
default values. They are all found in the
all.yml
file inside thegroup_vars
folder.- If you are installing this system on a remote server, you MUST
update the
ip_address
variable found inall.yml
. Change it to your remote server’s ip address. - If you are installing the system locally with Vagrant, you don’t need to do anything. Your server will be available at 192.168.45.10.
- If you are installing this system on a remote server, you MUST
update the
Vagrant-based Installation (local)¶
- If you are looking to run an instance of the Combine ecosystem on your own computer, you will use the Vagrant-based installation method. This method assumes that you have 8GB of RAM and 2 processor cores available to devote to this system. Double-check and make sure you have this available on your computer. This means you will need MORE than that in RAM and cores in order to not bring your computer to a complete halt. Local testing has been performed on iMacs running MacOS Sierra that have a total of 4 cores and 16 GB of RAM.
- Install VirtualBox, Vagrant, and Ansible, Python, and
Passlib.
- NB: when installing Passlib, you should be able to simply run
pip install passlib
if you have the pip tool installed. If you’re not certain or if that command doesn’t successfully run, see the following link for instructions on installing Pip: http://www.pythonforbeginners.com/basics/how-to-use-pip-and-pypi.
- NB: when installing Passlib, you should be able to simply run
- Clone the following Github repository: combine-playbook
- Navigate to the repository in your favorite terminal/shell/command line interface.
Within the root directory of the repository, run the commands listed below:
Install pre-requisites.
ansible-galaxy install -f -c -r requirements.ymlBuild the system.
vagrant up
This installation will take a while. The command you just ran initializes the vagrant tool to manage the installation process. It will first download and install a copy of Ubuntu Linux (v.18.04) on your VirtualBox VM. Then, it will configure your networking to allow SSH access through an account called
vagrant
and make the server available only to your local computer at the IP address of 192.168.45.10. After that initial work, the vagrant tool will use ansible to provision (i.e. install all components and dependencies) to a VM on your computer.After completed, your server will be available at http://192.168.45.10. Navigating to http://192.168.45.10/admin will allow you to setup your system defaults (OAI endpoints, etc). Going to http://192.168.45.10/combine will take you to the heart of the application where you can ingest, transform, and analyze metadata. Login using the credentials the following credentials:
username: combine password: combine
Access via SSH is available through the accounts below. Both have sudo privileges. The combine password defaults to what is listed below. If you have edited
group_vars/all.yml
and changed the password listed there, please adjust accordingly. ``` username: combine password: combineusername: vagrant password: vagrant ```
Ansible-based Installation (remote server)¶
If you have a remote server that you want to install the system upon, these installation instructions are for you. Your server should already be running Ubuntu 18.04. It needs to be remotely accessible through SSH from your client machine and have at least port 80 accessible. Also, it needs Python 2.7 installed on it. Your server will need at least 8GB of RAM and 2 cores, but more is better.
Install Ansible, Python, and Passlib on your client machine. This installation method has not been tested using Windows as client machine, and, therefore, we offer no support for running an installation using Windows as a client. For more information, please refer to these Windows-based instructions: http://docs.ansible.com/ansible/latest/intro_windows.html#using-a-windows-control-machine
- NB: when installing Passlib, you should be able to simply run
pip install passlib
if you have the pip tool installed. If you’re not certain or if that command doesn’t successfully run, see the following link for instructions on installing Pip: http://www.pythonforbeginners.com/basics/how-to-use-pip-and-pypi.
- NB: when installing Passlib, you should be able to simply run
Exchange ssh keys with your server.
Example command on MacOS
ssh-keygen -t rsa cat ~/.ssh/id_rsa.pub | ssh USERNAME@IP_ADDRESS_OR_FQDN "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
Point ansible to remote server.
- You do this by creating a file named
hosts
inside the following directory:/etc/ansible
. If you are using a Linux or MacOS machine, you should have anetc
directory, but you will probably have to create theansible
folder. Place your server’s IP address or FQDN in thishosts
file. If the username you used to exchange keys with the server is anything other than root, you will have to addansible_user=YOUR_USERNAME
. Your hosts file could end up looking something like this:192.168.45.10 ansible_user=USERNAME
. For more information see: http://docs.ansible.com/ansible/latest/intro_getting_started.html#your-first-commands
- You do this by creating a file named
Check your target machine is accessible and ansible is configured by running the following command:
ansible all -m ping
A successful response will look something similar to this. Substitute your IP for the one listed below in the example.
192.168.44.10 | SUCCESS => { "changed": false, "failed": false, "ping": "pong" }
If the response indicates a failure, it might look something like below. This below type of failure indicates that it could successfully connect to the server, but that it didn’t find Python 2.7 installed on the remote server. This is fine. The important part is that it could connect to the server. The ansible playbook will automatically install Python 2.7 when it begins, so you should be fine to proceed to the next step(s).
192.168.44.10 | FAILED! => { "changed": false, "failed": true, "module_stderr": "Warning: Permanently added '192.168.44.10' (ECDSA) to the list of known hosts.\r\n/bin/sh: 1: /usr/bin/python: not found\n", "module_stdout": "", "msg": "MODULE FAILURE", "rc": 127 }
Clone the following Github repository: combine-playbook
Navigate to the repository in your favorite terminal/shell/command line interface.
Update
ip_address
ingroup_vars/all.yml
- Change the
ip_address
variable to your remote server’s IP address.
- Change the
Within the root directory of the repository, run the commands listed below:
Install pre-requisites
ansible-galaxy install -f -c -r requirements.yml
Run ansible playbook
ansible-playbook playbook.yml
This installation will take a while. Ansible provisions the server with all of the necessary components and dependencies.
After the installation is complete, your server will be ready for you to use Combine’s web-based interface. Go to your server’s IP address. Navigating to
/admin
will allow you to setup your system defaults (OAI endpoints, etc). Going to/combine
will take you to the heart of the application where you can ingest, transform, and analyze metadata. Login using the following credentials:username: combine password: combine
Access via SSH is available through the account below. It has sudo privileges. The password below is correct unless you have changed it inside
group_vars/all.yml
.username: combine password: combine
Post-Installation walkthrough¶
Once you do have an instance of the server up and running, you can find a QuickStart walkthrough here.
Tuning and Configuring Server¶
Combine is designed to handle sets of metadata small to large, 400 to 4,000,000 Records. Some of the major associated server components include:
- MySQL
- store Records and their associated, full XML documents
- store Transformations, Validations, and most other enduring, user defined data
- store transactions from Validations, OAI requests, etc.
- ElasticSearch
- used for indexing mapped fields from Records
- main engine of field-level analysis
- Apache Spark
- the workhorse for running Jobs, including Harvests, Transformations, Validations, etc.
- Apache Livy
- used to send and queue Jobs to Spark
- Django
- the GUI
- Django Background Tasks
- for long running tasks that may that would otherwise prevent the GUI from being responsive
- includes deleting, re-indexing, exporting Jobs, etc.
Given the relative complexity of this stack, and the innerconnected nature of the components, Combine is designed to be deployed via an Ansible playbook, which you can read more about here. The default build requires 8g of RAM, with the more CPU cores the better.
This part of the documentation aims to explain, and indicate how to modify of configure, some of the these critical components.