Welcome to Dissemin’s Documentation!

Dissemin is a web platform gathering academic publications and analyzing their online availability as well as publisher policy. This documentation explains how it works and how to use it.

For Libraries

Dissemin not just detects papers behind paywalls, it offers solutions to liberate them. To achieve this goal with few effort for scientists, they can directly upload a publication into an open repository. Near the big repositories Zenodo and HAL, they can deposit publications in institutional repositories. This and the following pages will give you some information at hand how you can add your institutional repository to Dissemin.

As you will learn, this involves not to much effort. First, we describe the workflow and how the document finds it way into your repository. Then we explain our metadata, services and give an outline of how we guide you through the integration process. In a last step we dive into more technical topics, giving precise descriptions of the metadata encoding and how to transfer the file with its metadata.

General Information

How a deposit works

Let us assume, that Max Mustermann, a scientist at your institution, wants to deposit one of his publications in your repository. He starts by searching for his publications on Dissemin and finds in the list an article namend A modern view on ‘God of the Labyrinth’. Quickly he finds, that this publicaton is not yet freely available. Luckily the publisher allows depositing a preprint version into open repositories. Max decides to make this publication freely available by uploading it in his institutions repository. He clicks on Upload, chooses the right file from his local storage, gives a short subject classification, selects a Creative Commons license and hits finally the upload button. Dissemin creates a package of metadata and the file Max uploaded and ships it into your repository. Your repository receives the package, extracts the metadata and the document and creates a new entry in your repository.

Max does not know anything about this, but is happy as he reads, that his publication will soon be available for everyone! He also reads that he has to sign a letter of declaration.

Meanwhile the repository tells the repository staff that someone created a new entry. The staff checks the entry. Everything is fine, except that Max still has to deliver the letter of declaration. So far, the publication is not yet published, since the repository declared the availability of the document as ‘in moderation’.

Max downloads the letter directly from Dissemin. He reads it carefully and then signs it, as all important information has been filled already.

Some few days pass and the letter arrives at the repositoriy staff. The staff checks the letter, then navigates to the corresponding entry in the repository and changes the availability to ‘published’. The repository sends Max an e-mail that his article A modern view on `God of the Labyrinth` is now freely available.

Of course this is just an example with some arbitrary assumptions. Your local workflow may vary, but we see that you still decide what you publish and what you do not publish. And it’s still you, who defines the requirements.

How to add your repository

If you finally want to connect your repository to Dissemin, then please get in contact with us under team@dissem.in.

There are a few steps to accomplish this task.

Given our documentation, it is up to you which data you finally ingest in your repository. Aside from the question, which data you ingest, there is also the question how you ingest the data. For example, your repository has probably a different granularity when it comes to document types. How you map the metadata is completely up to you.

We have created a Workflow in our GitLab-Wiki. It covers the essential steps and is meant to be a guide. From the template we create an issue that we finally close once the repository has been successfully connected. In case of need, we add further steps or remove unnessary steps.

Metadata

Our main data sources are CrossRef and BASE. Our own data scheme is relatively close to Citeproc / CrossRef. This gives us the advantage, that—in general—we do not need to ask the user for metadata. This makes a deposit very little effort.

Bibliographic Metadata

For a publication we store the following metadata.

abstract
authors
date (YYYY-MM-DD)
document type
doi
eissn
issn
issue
journal
language
pages
publisher
title
volume

Note however, that we have renamed Citeprocs container-title into journal. This has a historical reason, aiming at first at journal articles.

Document types

We have the following document types:

book
book-chapter
dataset
journal-article
journal-issue
poster
preprint
proceedings
proceedings-article
reference-entry
report
thesis
other

Almost 90 per cent of our deposits are journal articles.

Additional Metadata

In addition to the above metadata we can ship out of the box the following:

depositor information (name, email, orcid)
dewey decimal class
embargo
language (with langdetect)
license
sherpa romeo id

Services

Services are meant to support your local workflow. If you are interested in a new service, please let us know.

Letter of Declaration

Usually institutional repositories require some kind of a letter of declaration from their scientists. With this letter the scientists declare certain legal statements about the publication and its deposition.

Example of letter of declaration service alert after upload
Serving PDF

Dissemin can generate these type of letters individually per repository. This way the letter fits your needs in terms of design, content and legal character. We can prefill the letter with all necessary data, so that the depositor just has to sign and send you the letter.

Our standard approach is to fill in the form fields of the document. This way we preserve the legal character, since there are no changes.

After the deposit the depositors are informed that they need to fill in such a letter and send it to your repository administration. They can directly download this letter. Of course they can regenerate this letter at any point in time as long you haven’t published the resource.

Linking to online form

If you provide an online form, Dissemin can link to this form.

Green Open Access Service

This means a service where the repository administration supports the researchers, e.g. by publishing on behalf of the researchers, which may include checking the rights, get in contact with the publishers and so on.

Dissemin allows to advertise this service after a successful deposit in your repository. The user will get a notification with a short text and a link that describes your service.

Example of Green Open Access service alert after upload

Technical Scope

In this document we describe the technical part of adding a new repository. For now we describe the SWORDv2 protocol, but we are not bound to this transmission form.

SWORDv2

To some extend the SWORDv2 protocol is easy to use as it involes just some basic HTTP operations. We decided to use its packaging capability, because this way we can easily ship metadata and document at the same time.

METS

We use the Metdata Encoding & Transmission Standard (METS) to ship our metadata and describe what is delivered. Most repositories are able to ingest a METS package.

We try to keep our METS as simple as possible. Currently we populate only dmdSec, amdSec/rightsMD, fileSec and structSec.

  • dmdSec contains the bibliographic metadata
  • amdSec/rightsMD contains information about the depositor, the license and additional Dissemin related information. You find the documentation in Dissemin Metadata.

We deliver two files per package:

  • mets.xml - containing the metadata
  • a_pdf_file.pdf - the document to be deposited. The name may vary as we ship the original filename from the user. You find the filename in structMap inside of mets.xml.

We set the following headers:

Content-Type: application/zip
Content-Disposition: filename=mets.zip
Packaging: http://purl.org/net/sword/package/METSMODS

If you require for your ingest a different packaging name or packaging format, please let us know.

MODS

Currently we create version 3.7. We populate as near as the definitions require as possible. You find below our mappings using XPath notation.

abstract => abstract
author => name[@type="personal"]namePart/given + family + name
author[orcid] => name/nameIdentifier[@type=orcid]
date => originInfo/dateIssued[@enconding="w3cdtf"] (YYYY-MM-DD)
ddc => classification[@authority="ddc"]
document type => genre
doi => identifier[@type="doi"]
essn => relatedItem/identifier[@type="eissn"]
issn => relatedItem/identifier[@type="issn"]
issue => relatedItem/part/detail[@type="issue"]/number
journal => relatedItem/titleInfo/title
language => language/languageTerm[@type="code"][@authority="rfc3066"]
pages => relatedItem/part/extent[@unit="pages"]/total or start + end
publisher => originInfo/publisher
title => titleInfo/title
volume => relatedItem/part/detail[@type="volume"]

Note that volume, issue and pages are often not arabic numbers, but may contain other literals. Although MODS does provide fields for declarations like No., Vol. or p. we do not use this, because our datasources don’t.

We ship the language as rfc3066 determined by langdetect and only if both conditions are satisfied:

  1. The abstract has a length of at least 256 letters (including whitespaces)
  2. langdetect gains a confidence of at least 50%

If we cannot determine any language, we omit the field.

We ship the ddc number as three-digit-number, i.e. filling up with leading zeros for numbers smaller than 99.

Dissemin Metadata

MODS does not completely fit our needs in terms of metadata. Thus we add our own metadata that extends the MODS metadata.

We ship the following data:

Data Explanation
authentication method The method of authentication. This is currently shibboleth and orcid.
name The first and last name of the depositor. Note that the depositor does not need to be a contributor of the publication.
email E-Mail address of the depositor in case you need to contact them.
orcid ORCID if the depositor has one.
is contributor true/false and states if the depositor is one of the contributors or if deposits on behalf
identical institution true/false and states if the depositors institution is identical with the institution of the repository.
license The license. Most likely Creative Commons, but different licenses are possible. We happily add new licenses for your repository. We deliver the name and if existing URI and a transmit id.
embargoDate The publication must not be published before this date. It may be published on this date.
SHERPA/RomeoID ID of the journal from SHERPA/RoMEO. Using their API or web interface you can quickly obtain publishers conditions.
DisseminID This ID refers to the publication in Dissemin. This ID is not persistent. The reason is the internal data model: Merging and unmerging papers might create or delete primary keys in the database. For a ‘short’ period of time, this id will definetely be valid. You can use the DOI shipped in the bibliographic metadata to get back to the publication in Dissemin.

If you need more information for your workflow, please contact us. We can add additional fields.

You can find our schema for download in Version 1.0.

Deposit Receipt

Dissemin expects that your repository returns a SWORDv2 deposit receipt, which is optional in the SWORDv2 standard. Please make sure that it contains a splash url, i.e. the landing page of the deposited document for the user. Your deposit receipt shall look like:

<?xml version="1.0"?>
<entry xmlns="http://www.w3.org/2005/Atom">
    ...
    <link rel="alternate" href="https://repository.dissem.in/item/12345"/>
    ...
</entry>

Where href contains of course the splash url of the deposited item.

Currently Dissemin will extract the identifier from the splash url.

Examples and Scripts

To support you in your local implementation we have some examples and scripts.

Examples

The examples are authentic, i.e. they are created with Dissemin and represent how the metadata documents will loke like. For earch document type there is one or more example. They cover different cases like dewey decimal class or embargo.

Upload scripts

You can download our script for testing your implementations. The HTTP-request is identical to that in Dissemin. You find usage instructions in the README.md inside of the packaging.

Update Deposit Status

Unless a document is directly published in a repository, the internal publication status inside Dissemin will be pending.

Dissemin does know the following status:

('failed', _('Failed')), # we failed to deposit the paper
('pending', _('Pending publication')), # the deposit has been submitted but is not publicly visible yet
('embargoed', _('Embargo')), # the publication will be published, but only after a certain date
('published', _('Published')), # the deposit is visible on the repo
('refused', _('Refused by the repository')),
('deleted', _('Deleted')), # deleted by the repository

In order to keep the status up to date and inform the user, when his publication is freely available, we ask the repository about the status on a daily basis as long as the status is pending. This requires some extra work as we cannot use OAI-PMH, as this won’t inform us about declined deposits or embargos.

Given an endpoint, put a little script that does the job. From the SWORDv2 response we extract the entry id in your repository and pass that id as GET-parameter, like so

https://repository.example.org/scripts/status?id=3243

or

https://repository.example.org/entry/3243?convert=toDissemin

As response we expect simple JSON containing status, publication_date, pdf_url where status is one of pending, embargoes, published, refused. In case of embargoed and published we like to have publication date, i.e. when the resource is publicly available, as YYYY-MM-DD and the direct link to the pdf if possible. Below we have a simple example.

{
    "status" : "published",
    "publication_date" : "2020-03-12",
    "pdf_url" : "https://repository.example.org/documents/3234/document.pdf"
}

We do not have plans to support any batch processing at the moment.

Repository Helpers

We cannot directly support for necessary implementations or configurations on the repository that is going to be connected.

But we like to support any repository administrator with at least some documentation.

EPrints 3

EPrints 3 has been successfully be connected to Dissemin.

Zaharina Stoynova from ULB Darmstadt has worked on a plugin to ingest Dissemins metadata. As it is probably not possible to use it directly, please make the necessary changes as you require.

broker_eprints_3.zip

The package consists of two files:

  1. METSMODS_Broker.pm
  2. METSMODS_Brokder_mods_parser.pm

The first file deals with some general things like data integrity and ingests the Dissemin metadata, while the other file deals with the MODS metadata itself.

API

Dissemin provides an API to query the availability of arbitrary papers.

Querying the API

Querying by DOI

You can retrieve Dissemin’s metadata for a specific paper by DOI:

https://dissem.in/api/10.1016/j.paid.2009.02.013

Querying by Dissemin Paper ID

Dissemin stores internal numeric identifiers for its papers. These identifiers are exposed in the URLs of the paper pages, for instance. It is possible to retrieve metadata from these identifiers:

<https://dissem.in/api/p/6859902>_

Querying by Metadata Fields

When the DOI or the Dissemin ID are not known, it is possible to retrieve a paper by title, authors and publication date. This is done by posting a JSON object encoding this metadata to https://dissem.in/api/query/ , as follows:

curl -H "Content-Type: application/json" -d '{"title":"Refining the Conceptualization of an Important Future-Oriented Self-Regulatory Behavior: Proactive Coping", "date":"2009-07-01","authors":[{"first":"Stephanie Jean","last":"Sohl"},{"first":"Anne","last":"Moyer"}]}' https://dissem.in/api/query/

The date field can contain coarser dates such as 2009-07 or 2009, and authors can also be specified as plain text with {"plain":"Anne Moyer"} instead of {"first":"Anne","last":"Moyer"}.

This API method uses the internal paper deduplication strategy in Dissemin to match the bibliographic reference to a known paper in the database. This deduplication is done by computing a unique key (called fingerprint) from the title, authors and publication date. Therefore, this API method will always return at most one paper, unlike the search endpoint below which works like traditional search engines.

Searching the API

The search interface is also exposed as an API. The parameters it understands are the same as the human-readable version at https://dissem.in/search. Statistics about the results are also returned.

There are the following search keys:

authors
List of authors, separated by ,. To enforce a last name, prefix with last:.
doctypes
Filter by document types. There are the following document types available: book, book-chapter, dataset, journal-article, journal-issue, other, poster, preprint, proceedings, proceedings-article, reference-entry, report, thesis
pub_after
Published after given date. The format is YYYY, YYYY-MM, YYYY-MM-DD.
pub_before
Published before given date. The format is YYYY, YYYY-MM, YYYY-MM-DD.
q
Search for title
sort_by
The results are sorted descending by date, i.e. newest first. To revserse the order, pass pubdate.
status

The open access status as computed by Dissemin. This can be one of

oa
Available from the publisher
ok
Available from the author
couldbe
Could be shared by the authors
unk
Unknown/unclear sharing policy
closed
Publisher forbids sharing

You can pass multiple status.

Understanding the Results

{

    "status": "ok",
    "paper": {
        "classification": "UNK",
        "title": "Refining the Conceptualization of an Important
Future-Oriented Self-Regulatory Behavior: Proactive Coping",
        "pdf_url": "http://www.ncbi.nlm.nih.gov/pubmed/19578529",
        "records": [
            {
                "splash_url": "https://doi.org/10.1016/j.paid.2009.02.013",
                "doi": "10.1016/j.paid.2009.02.013",
                "publisher": "Elsevier BV",
                "issue": "2",
                "journal": "Personality and Individual Differences",
                "issn": "0191-8869",
                "volume": "47",
                "source": "crossref",
                "policy": {
                    "romeo_id": "30",
                    "preprint": "can",
                    "postprint": "can",
                    "published": "cannot"
                },
                "identifier": "oai:crossref.org:10.1016/j.paid.2009.02.013",
                "type": "journal-article",
                "pages": "139-144"
            },
            {
                "splash_url": "https://www.researchgate.net/publication/26648440_Refining_the_Conceptualization_of_an_Important_Future-Oriented_Self-Regulatory_Behavior_Proactive_Coping",
                "doi": "10.1016/j.paid.2009.02.013",
                "contributors": "",
                "abstract": "Proactive coping, directed at an upcoming as
opposed to an ongoing stressor, is a new focus in positive psychology
research. However, two differing conceptualizations of this construct
create confusion. This study compared how each operationalization of
proactive coping relates to well-being. Participants (N = 281) facing an
upcoming college examination completed the Proactive Coping Inventory
(PCI; consisting of two subscales that each assess one of the
conceptualizations), the Proactive Competence Scale (PCS; that assesses
the proactive coping process), and measures of well-being. The results
demonstrated that conceptualizing proactive coping as a
positively-focused striving for goals was predictive of well-being (the
shared variance from affect, subjective well-being and physical
symptoms), whereas conceptualizing proactive coping as focused on
preventing a negative future was not. The first conceptualization of
proactive coping's unique association with well-being was explained by
two of the proactive competencies, use of resources and realistic goal
setting, and the remaining variance in well-being was explained by the
first factor of optimism. These results demonstrated that aspiring for a
positive future is distinctly predictive of well-being and that research
should focus on accumulating resources and goal setting in designing
interventions to promote proactive coping.",
                "pdf_url": "https://www.researchgate.net/profile/Stephanie_Sohl2/publication/26648440_Refining_the_Conceptualization_of_an_Important_Future-Oriented_Self-Regulatory_Behavior_Proactive_Coping/links/55e463c008ae2fac47227a76.pdf",
                "source": "researchgate",
                "keywords": "",
                "identifier": "oai:researchgate.net:26648440",
                "type": "journal-article"
            },
            {
                "splash_url": "http://www.ncbi.nlm.nih.gov/pubmed/19578529",
                "doi": "10.1016/j.paid.2009.02.013",
                "contributors": "",
                "abstract": "Proactive coping, directed at an upcoming as
opposed to an ongoing stressor, is a new focus in positive psychology
research. However, two differing conceptualizations of this construct
create confusion. This study compared how each operationalization of
proactive coping relates to well-being. Participants (N = 281) facing an
upcoming college examination completed the Proactive Coping Inventory
(PCI; consisting of two subscales that each assess one of the
conceptualizations), the Proactive Competence Scale (PCS; that assesses
the proactive coping process), and measures of well-being. The results
demonstrated that conceptualizing proactive coping as a
positively-focused striving for goals was predictive of well-being (the
shared variance from affect, subjective well-being and physical
symptoms), whereas conceptualizing proactive coping as focused on
preventing a negative future was not. The first conceptualization of
proactive coping’s unique association with well-being was explained by
two of the proactive competencies, use of resources and realistic goal
setting, and the remaining variance in well-being was explained by the
first factor of optimism. These results demonstrated that aspiring for a
positive future is distinctly predictive of well-being and that research
should focus on accumulating resources and goal setting in designing
interventions to promote proactive coping.",
                "pdf_url": "http://www.ncbi.nlm.nih.gov/pubmed/19578529",
                "source": "base",
                "keywords": "Article",
                "identifier": "ftpubmed:oai:pubmedcentral.nih.gov:2705166",
                "type": "other"
            }
        ],
        "authors": [
            {
                "name": {
                    "last": "Sohl",
                    "first": "Stephanie Jean"
                }
            },
            {
                "name": {
                    "last": "Moyer",
                    "first": "Anne"
                }
            }
        ],
        "date": "2009-07-01",
        "type": "journal-article"
    }

}

Most fields are self-explanatory, here is a quick description of the other ones:

  • classification is the code for the self-archiving policy of the publisher “OA” (available from the publisher), “OK” (some version can be shared), “UNK” (unknown/unclear sharing policy), “NOK” (restrictive sharing policy).
  • splash_url is a URL where Dissemin thinks that the paper is described, without being necessarily available. This can be a publisher webpage (with the article available behind a paywall), a page about the paper without a copy of the full text (e.g., a HAL page like https://hal.archives-ouvertes.fr/hal-01664049), or a page from which the paper was discovered (e.g., the profile of a user on ORCID).
  • pdf_url is a URL where Dissemin thinks the full text can be accessed for free. This is rarely a direct link to an actual PDF file, i.e., it is often a link to a landing page (e.g., https://arxiv.org/abs/1708.00363). It is set to null if we could not find a free source for this paper.
  • records gives a list of the places where the full text has been made available (so: repositories, homepages or social networks). Sometimes, these repositories only contain a bibliographical record and not the full text. The pdf_url field of each record indicates our assessment of the availability of that record. If the publisher has been found in RoMEO, it also indicates the summary of its policy, using the codes drawn from the RoMEO API. This list will remain empty if no DOI is provided.

License, Usage

CAPSH claims no ownership of the metadata served via this API. It has been collected from various free sources.

The interface itself should not be abused: please do not use concurrent connections on it, and keep your requests to a slow rate (at most one per second). If you need a faster access to this data, please get in touch with us.

Data sources

Dissemin works with various data sources, providing bibliographic references, full texts and publisher’s policies.

CrossRef

CrossRef is an association of publishers, mainly in charge of issuing Digital Object Identifiers (DOIs) for academic publications. We harvest the CrossRef API on a daily basis.

CrossRef is our main source of publications. On harvesting, we try to match the journal with information from SHERPA/RoMEO.

BASE

BASE is a search machine for academic publications. BASE harvests several thousand free and open repositories. We use BASE to extend the information of full text availability. It covers a huge amount of green open access publications.

SHERPA/RoMEO

SHERPA/RoMEO is a service run by JISC which provides a semi-structured representation of publisher’s self-archiving policies. They offer an API, whose functionality is very similar to the search service they offer to their regular users. You can search for a policy by journal or by publisher. Since some publishers have multiple archiving policies, RoMEO recommends to search by journal because it ensures that you will get the policy in place for this specific journal.

We synchronize our data with SHERPA’s data every two weeks using their dumps.

For many journal articles and all conference papers, RoMEO knows the publisher but not the journal, and the metadata returned by CrossRef contains both the journal (or the proceedings title) and the publisher. We use therefore a two-step approach:

  • We search for the journal: if it succeeds, we assign the journal and the corresponding policy to the paper.
  • If it fails, we search for the a default policy from the publisher. Default policies are those which have a null romeo_parent_id.

Because the publisher names are not always the same in CrossRef and SHERPA/RoMEO, we add heuristics to disambiguate publishers. We use the papers for which a corresponding journal was found in SHERPA and collect their publisher names as seen in Crossref. If we see that a given Crossref publisher name is overwhelmingly associated to a given SHERPA Publisher which is a default publisher policy (romeo_parent_id is null), then we also link Crossref papers with this publisher name but no matching journal to this SHERPA Publisher.

ORCID

ORCID has a public API that can be used to fetch the metadata of all papers (“works”) made visible of any ORCID profile (unfortunately, very often, the profiles are empty). ORCID does not enforce any strict metadata format, which makes it hard to import papers in Dissemin. Specifically, works do not always have a list of authors (which is a shame given that this service is supposed to solve ambiguity of author names). Even worse, when an authors list is provided, the owner of the ORCID record is almost never identified in this list.

We try to make the most of the available metadata:

  • If a DOI is present, we fetch the metadata using content negociation ;
  • If a Bibtex version of the metadata is available, parse the Bibtex record to extract the title and author names ;
  • Otherwise, if no authors are given, skip the paper.

Data model

This section explains how metadata is represented in Dissemin. There are two important models: OaiRecord and Paper (both defined in papers/models.py).

_images/oairecord_paper.png

The OaiRecord model represents an occurence of a paper in some external repository (from the publisher or from an open repository). Each OaiRecord has at least a splash_url (the URL of the landing page of the paper in the repository) and sometimes a pdf_url. The pdf_url is present if and only if we think that the full text is available from this repository. This pdf_url should ideally be a direct link to the full text, but often it is actually equal to the splash_url (but its presence still indicates that the full text is available somehow).

These records are grouped into Paper objects (via a foreign key from OaiRecord to Paper). This deduplication process is done by two criteria:

  • first, OaiRecords with the same DOI are merged into the same paper.
  • second, we compute a fingerprint of the OaiRecord metadata, which consists of the title, author last names and publication year. Any two OaiRecords with identical fingerprints are also merged into the same Paper.

Dissemin harvests four metadata sources: ORCID, Crossref, BASE and Unpaywall (oadoi). Each of these implements the PaperSource interface, which provides mechanisms to push the papers to the database. The responsibility of each PaperSource is to provide BarePaper instances, which are Python objects representing papers which have not been saved to the database yet (and therefore not deduplicated). When doing this, each PaperSource determines from the medatada they have access to whether pdf_url should be filled or not (depending on whether we think the metadata indicates full text availability).

Contributing

This section explains how to do some development tasks on the source code or helping with translations.

Setting up Dissemin for Development in an IDE

This page lists some possible ways to set up Dissemin locally for development, including setting up an IDE to edit Dissemin conveniently. First, you need to install Dissemin locally: see Installation for that. In particular, you will need to have postgres, redis and elasticsearch instanced running during development, as these services are required to run the tests.

Eclipse and PyDev

Although it is primarily designed to work on Java programs, Eclipse can be used to work on Python projects thanks to its PyDev extension. This includes a Python editor, debugger and importantly, the ability to run tests selectively from the editor. Moreover PyDev comes with a Django integration too.

To install Eclipse, simply download it and unzip it in your favourite location. Fire up Eclipse and click Help, Eclipse Marketplace. In the field to search for new software, type PyDev and install it.

You will then need point Eclipse to your copy of Dissemin, so that it can be opened as a project. To do so, click File, Open Projects from File System, and select the directory where you have cloned Dissemin. Click Finish: this will create your project. The project might not be recognized as a Python project, so enable PyDev on it with a right click on the project (in the Project Explorer), click PyDev, Set as PyDev Project. Do the same to enable it as a Django project too.

Assuming you have installed Dissemin’s Python dependencies in a Virtualenv, you will need to configure that too (otherwise Eclipse will try to run Dissemin with the system’s own Python installation). To do so, right click on the project, select Properties and go to the PyDev - Interpreter/Grammar pane. Then click the Click here to configure an interpreter not listed, choose Open interpreter preference pages. Then Browse for python/pypy exe, and select the Python executable in your virtualenv (it normally lives at my_virtualenv/bin/python). Give it a name such as Dissemin virtualenv. When prompted to add entries to PYTHONPATH, select them all and validate. Finally, click Apply and close. Once you are back to the project’s own interpreter configuration page, do not forget to select the newly-created interpreter configuration.

Finally, you will need to configure PyDev so that it uses pytest to run the tests. This will have the benefit of handling Django’s initialization for you. This setting is not stored at project level, you need to go in PyDev’s general preferences to change this. Click Window, Preferences, select the PyUnit pane in the PyDev group and choose the Py.test runner. Finally, click Apply and Close.

You can now easily run individual tests from the editor. Go to a test file, use Ctrl-Shift and the up and down arrows to navigate to the test method or test class that you want to run. Then press Ctrl-F9 and validate with Enter. This will run your tests and display their results in the dedicated pane below. You can also set up breakpoints and run the tests in the debugger.

PyCharm

PyCharm’s Django integration is not available in the Community edition. However, because we use Pytest to run tests, it might be possible to use the Community edition anyway. If you try, please let us know how it goes so that we can update this documentation.

Using a standard text editor

That works too, of course. In that case you might want to make sure you run ./pyflakes.sh to check for import errors and other syntactical issues before you commit or push. This can easily done by adding a git hook:

ln -s pyflakes.sh .git/hooks/pre-commit

Localization

We use Django’s standard localization system, based on i18n. This lets us translate strings in various places.

Localization in Files

Most localizations are in files:

  • in Python code, use _("some translatable text"), where _ is imported by from django.utils.translation import ugettext_lazy as _

  • in Javascript code, use gettext("some translatable text")

  • in HTML templates, use either {% trans "some translatable text" %} for inline translations, or, for longer blocks of text:

    {% blocktrans trimmed %}
    My longer translatable text.
    {% endblocktrans %}
    

    The trimmed argument is important as it ensures leading and trailing whitespace and newlines are not included in the translation strings (they mess up the translation system).

Localized Models

Currently the following models use translations:

  • DDC (field name) in deposit.models
  • GreenOpenAccessService (fields heading, text) in deposit.models
  • LetterOfDeclaration (fields heading, text, url_text) in deposit.models
  • License (field name) in deposit.models
  • Repository (fields name, description) in deposit.models

For localization we use django-vinaigrette. Please read their documentation for further information. In short: You have to keep in mind that:

  • in admin interface you do not see the localized strings,
  • you should add only English in the admin interface,
  • you will have to recreate the *.po files and add the translation manually (see below),
  • your local translations do not interact with TranslateWiki.

From our production environment we have extracted strings from above models. They are stored in model-gettext.py so that we have them available for TranslateWiki.

Generating PO files

We have a two way system of generating the PO files. Most of the translation string can be generated locally, except our strings from the production database (the localized models). The important thing when generating PO files locally is to preserve the strings from production and do not overwrite them with your local string from your local database.

Vinaigrette saves the strings to be translated in a file called vinaigrette-deleteme.py. Usually this files is going to be deleted, but we keep it as it carries our translation strings from the models.

Since we use TranslateWiki, we please do not generate any .po files, as there is a high chance of a merge conflict. Just state that your PR uses localizations, then the Dissemin team will generate to .po files.

Unless you need localization in your development environment, you can ignore the following sections.

Generate locally

To generate the PO files, run:

python manage.py makemessages --keep-pot --no-wrap --no-vinaigrette --ignore doc --ignore .venv

This will generate a PO template locale/django.pot that can be used to update the translated files for each language, such as locale/fr/LC_MESSAGES/django.po. It does not change vinaigrette-deleteme.py. Note that in some circumstances Django can generate new translation files for languages not yet covered. In this case these new files should be deleted, as they will break Translatewiki. It is also necessary to generate separate PO files for JavaScript translations:

python manage.py makemessages -d djangojs --keep-pot --no-wrap --no-vinaigrette --ignore doc --ignore .venv

You can then compile all the PO files into MO files so that they can be displayed on the website:

python manage.py compilemessages --exclude qqq

That’s it! The translations should show up in your browser if it indicates the target language as your preferred one. Locale qqq contains instructions for translators and therefore does not require compiling (which is worth avoiding since it can contain errors).

Generate from Models

On the production environment run:

python manage.py makemessages --keep-pot --no-wrap --keep-vinaigrette-temp --ignore doc --ignore .venv

This generates locales as above, but generates additionally a vinaigrette-deleteme.py. Add this file as model-gettext.py to version control and proceed locally as in above section.

You can safely revert all PO files on production with:

git checkout -- locale
git clean -f locale

Available Languages

You can change the set of available languages for your installation in dissemin/settings/common.py by changing the LANGUAGES list, e.g. by commenting or uncommenting the corresponding lines.

Translations

Translations are hosted at TranslateWiki, for an easy-to-use interface for translations and statistics.

Tests

Dissemin’s test suite is run using pytest rather than using Django’s ./manage.py test. Pytest offers many additional features compared to Python’s standard unittest which is used in Django. To run the test suite, you need to install pytest and other packages, mentioned in requirements-dev.txt.

The test suite is configured in pytest.ini, which determines which files are scanned for tests, and where Django’s settings are located.

Some tests rely on remote services. Some of them require API keys, they will fetch them from the following environment variables (or be skipped if these environment variables are not defined):

  • ROMEO_API_KEY
  • ZENODO_SANDBOX_API_KEY required for tests of the Zenodo interface. This can be obtained by creating an account on sandbox.zenodo.org and creating a “Personal Access Token” from there.

Fixtures

Dissemin comes with some fixtures predefined. There are mainly two types:

  1. Fixtures coming from load_test_data
  2. Pure python fixtures in conftest.py

While the first class of fixtures laods a lot of data into the test database, they are not always suitable and little obscur. We encourage you not to use them except it is necessary.

The second class is not yet completed. You find some fixtures in the projects root. You can add more fixtures as you need them. If your fixture is only suitable or interesting for a single app, please use it’s conftest.py.

So, for example, if you need more repositories or with special properties, add the corresponding function into the Dummy class of the fixture repository. If you want to use this new repository often out of the box, add a new fixture, that gets it from the Dummy class as shown with the dummy_repository fixture.

The benefit of the second approach is more control and better extensibility.

We also provide some fixtures in JSON in our folder test_data.

Mocking

Currently we partially use mocking. If you write any new test, please use a proper mocking.

DOIResolver

A common scenario is to fetch a paper by DOI to the database. To do this, you can use Paper.create_by_doi from the paper class. Then you just use the fixture mock_doi and add a JSON file for the doi. You need to use slugified name. You can run the test and check the output, you will see the filename.

HTML-Validation

Dissemin has tests for HTML validation. Usually this validation is done for the English language, but we check for all available languages on Tuesday with Travis CI. For this we use the Nu Markup Checker (VNU).

Your Vagrant should be fine, since we deploy a docker image running at http://localhost:8888. If you do not use Vagrant, make sure to have the VNU server running while your tests are running. You can e.g. do

sudo docker run -it --restart always -p 8888:8888 validator/validator:latest

This way the validator will be up and running after restarting your machine.

To check a page, use the fixture check_html.

Documentation

This documentation is generated by sphinx and hosted on Readthedocs, so it will be always up to date.

Building local documentation

In case you need a local documentation, e.g. to check formatting, activate your virtual environment, make sure to have installed requirements-dev.txt.

Then, compile these RST sources to HTML:

cd doc/sphinx ; make html

The HTML output is then available in doc/sphinx/_build/html/.

The theme is the same as from Readthedocs.

Generating model diagrams

The UML diagrams for the models are generated using the django-extensions library. When making changes to the models, these diagrams should be updated. They are generated as follows:

./manage.py graph_models papers | dot -Tpng -o papers_models.png

It is possible to generate a single graph for multiple apps, using:

./manage.py graph_models papers publishers deposit | dot -Tpng many_models.png

This relies on the graphviz package to render the graphs to PNG (apt-get install graphviz). More documentation about this feature can be found at https://django-extensions.readthedocs.io/en/latest/graph_models.html.

Writing an Interface for a Repository

Writing an interface for a new repository is very easy! Here is a quick tutorial, whose only requirements are some familiarity with Python, and have a running instance of Dissemin.

First, you check that the repository of interest has an API that allows that. This can either be a REST API or SWORD-Protocol or something else - depending on the repositoriy.

Implementing the protocol

Protocol implementations are stored as submodules of the deposit module. So for a new protocol you create a new folder and create a __init__.py to activate it. In case you rely on SWORDv2 with METS you can add a subclass into its protocol.py.

To tell Dissemin how to interact with the repository, you need to write an implementation of that protocol. This has to be done in a dedicated file, deposit/newprotocol/protocol.py, by creating a subclass of RepositoryProtocol:

from django.utils.translation import ugettext as _

from deposit.protocol import RepositoryProtocol

class NewProtocol(RepositoryProtocol):

    def submit_deposit(self, pdf, form):
        result = DepositResult()

        #########################################
        # Your paper-depositing code goes here! #
        #########################################


        # If the deposit succeeds, our deposit function should return
        # a DepositResult, with the following fields filled correctly:

        # A unique id provided by the repository (useful to modify
        # the deposit afterwards). The form is free, it only has to be a string.
        result.identifier = 'myrepo/deposit/12345'

        # The URL of the page of the paper on the repository,
        # and the URL of the full text
        result.splash_url = 'http://arxiv.org/abs/0809.3425'
        result.pdf_url = 'http://arxiv.org/pdf/0809.3425'

        return result

Let us see how we can access the data provided by Dissemin to perform the upload. The paper to be deposited is available in self.paper, as a Paper instance. This gives you access to all we know about the paper: title, authors, sources, bibliographic information, identifiers, publisher’s policy, and so on. You can either access it directly from the attributes of the paper, for instance with self.paper.title, or use the class deposit.utils.MetadataConverter where you can access all information directly. This in particular useful for the OaiRecords, since you do not need to gather the data from those.

With all this information you create metadata that you deliver with the pdf file.

If you find that you need metadata not provided by Dissemin, you can bother the user by adjusting the metadata form presented to the user. For this you create forms.py and subclass BaseMetadataForm and add more fields.

The PDF file is passed as an argument to the submit_deposit method. It is a path to the PDF file, which you can open with open(pdf, 'r') for instance.

You also have access to the settings for the target repository, as a Repository object, in self.repository. This should give you all the information you need about how to connect to the repository: self.repository.endpoint, self.repository.username, self.repository.password, and so on (see the documentation of Repository for more details).

If the deposit fails for any reason, you should raise protocol.DepositError with an helpful error message, like this:

raise protocol.DepositError(_('The repository refused your paper.'))

Note that we localize the error (with the underscore function).

It is generally a good idea to log messages to keep track of how the deposit does. You can use the embedded logger, so that your log messages will be saved by Dissemin in the relevant DepositRecords, like this:

self.log("Do not forget to log the responses you get from the server.")

Testing the protocol

So now, how do you test this protocol implementation? Instead of testing it manually by yourself, you are encouraged to take advantage of the testing framework available in Dissemin. You will write test cases, that check the behaviour of your implementation for particular PDF files and paper metadata.

We provide currently 20 examples of metadata that you can use as fixtures. Additionally we have fixtures for various settings of the repository, e.g. Dewey Decimal Class and Licenses. You find the data as JSON in test_data. The best way is probably to get familiar with pytest and check out the examples in deposit.sword.tests. Your tests should be a subclass of deposit.tests.test_protocol.MetaTestProtocol as this defines some tests that every protocol should pass.

Using the protocol

So now you have your shiny new protocol implementation and you want to use it.

First, we need to register the protocol in Dissemin. To do so, add the following lines at the end of deposit/newprotocol/protocol.py:

from deposit.registry import *
protocol_registry.register(NewProtocol)

Next, add your protocol to the enabled apps, by adding deposit.newprotocol in the INSTALLED_APPS list of dissemin/settings/common.py:

...
'deposit',
'deposit.zenodo',
'deposit.newprotocol',
...

Now the final step is to configure a repository using that protocol. Launch Dissemin, go to Django’s web admin, click Repositories and add a new repository, filling in all the configuration details of that repository. The Protocol field should be filled by the name of your protocol, NewProtocol in our case.

Now, when you go to a paper page and try to deposit it, your repository should show up, and if everything went well you should be able to deposit papers.

Each deposit (successful or not) creates a DepositRecord object that you can see from the web admin interface. If you have used the provided log function, the logs of your deposits are available there.

To debug the protocol directly from the site, you can enable Django’s settings.DEBUG (in dissemin/settings.py) so that exceptions raised by your code are popped up to the user.

Adding extra metadata with forms

What if the repository you submit to requires additional metadata, that Dissemin does not always provide? We need to add a field in the deposit form to let the user fill this gap.

Fortunately, Django has a very convenient interface to deal with forms, so it should be quite straightforward to add the fields you need.

Let’s say that the repository we want to deposit into takes two additional pieces of information: the topic of the paper (in a set of predefined categories) and an optional comment for the moderators.

All we need to do is to define a form with these two fields:

from django.utils.translation import ugettext_lazy as _

from deposit.forms import BaseMetadataForm

# First, we define the possible topics for a submission
MYREPO_TOPIC_CHOICES = [
    ('quantum epistemology',_('Quantum Epistemology')),
    ('neural petrochemistry',_('Neural Petrochemistry')),
    ('ethnography of predicative turbulence',_('Ethnography of Predicative Turbulence')),
    ('other',_('Other')),
    ]

# Then, we define our metadata form
class NewProtocolForm(BaseMetadataForm):

    # Fields are declared as class arguments
    topic = forms.ChoiceField(
        label=_('Topic'), # the label that will be displayed on the field
        choices=MYREPO_TOPIC_CHOICES, # the possible choices for the user
        required=True, # is this field mandatory?
        # other arguments are possible, see https://docs.djangoproject.com/en/2.2/ref/forms/fields/
        )

    comment = forms.CharField(
         label=_('Comment for the moderators'),
         required=False)

Then, we need to bind this form to our protocol. This looks like:

from deposit.newprotocol.forms import NewProtocolForm

class NewProtocol(RepositoryProtocol):

    # The class of the form for the deposit
    form_class = NewProtocolForm

    def submit_deposit(self, pdf, form):
        pass

Helping Repository Administrators

To help administrators of repsitories, you should provide sample data that they can use to test the ingest. For this we use once again pytest. Ideally you have a function generating the metadata. Write a test with a marker write_new_protocol_examples like in MetaTestSWORDMETSProtocol and change pytest.ini accordingly.

FAQ & Tips

Here are some frequently asked questions and tips for getting started to work and contribute to Dissemin. The best idea to start hacking on Dissemin is probably to use the VM (Vagrant method from Vagrant).

Fetching a specific paper by DOI

The Dissemin VM is quite empty by default. If you want to inspect particular paper, it is possible to fetch it by DOI by visiting http://localhost:8080/<DOI>.

Fetching a specific ORCID profile

The Dissemin VM uses the sandbox ORCID API out of the box. Therefore, you cannot fetch a specific profile from ORCID. Here is how to locally fetch a specific profile from ORCID, in order to reproduce and debug bugs in fetching the list of papers for instance.

First, edit the dissemin/settings/__init__.py file to set the ORCID API to use to the real ORCID API, putting the line ORCID_BASE_DOMAIN = 'orcid.org'.

Then, you should find the ORCID ID you want to fetch locally. You can use the official instance, https://dissem.in/, to search for a given author and get the ORCID ID.

Finally, restart both the Django process as well as the Celery process in the VM and head to http://localhost:8080/<ORCID_ID>, replacing <ORCID_ID> by the full ORCID ID.

Installation

There are two ways to install Dissemin. The automatic way uses Vagrant to install Dissemin to a container or VM, and only takes a few commands. The manual way is more complex and is described afterwards.

Vagrant

First, install Vagrant and one of the supported providers: VirtualBox (should work fine), LXC (tested), libvirt (try it and tell us!). Then run the following commands:

  • git clone https://gitlab.com/dissemin/dissemin will clone the repository, i.e., download the source code of Dissemin. You should not reuse an existing copy of the repository, otherwise it may cause errors with Vagrant later.
  • cd dissemin to go in the repository
  • Run vagrant plugin install vagrant-vbguest to install the VirtualBox guest additions plugin for Vagrant
  • Run ./install_vagrant.sh to create the VM / container and provision the machine once
  • vagrant ssh will let you poke into the machine and access its services (PostgreSQL, Redis, ElasticSearch)
  • A tmux session is running so that you can check out the Celery and Django development server, attach it using: tmux attach. It contains a bash panel, two panels to check on Celery and Django development server and a panel to create a superuser (admin account) for Dissemin, which you can then use from localhost:8080/admin.

Dissemin will be available on your host machine at localhost:8080.

Note that, when rebooting the Vagrant VM / container, the Dissemin server will not be started automatically. To do it, once you have booted the machine, run vagrant ssh and then cd /dissemin and ./launch.sh and wait for some time until it says that Dissemin has started. The same holds for other backend services, you can check the Vagrantfile and provisioning/provision.sh to find out how to start them.

Manual Installation

This section describes manual installation, if you cannot or do not want to use Vagrant as indicated above. It also serves as installation guide for production. Dissemin is split in two parts:

  • the web frontend, powered by Django;
  • the tasks backend, powered by Celery.

Installing the tasks backend requires additional dependencies and is not necessary if you want to do light dev that does not require harvesting metadata or running author disambiguation. The next subsections describe how to install the frontend; the last one explains how to install the backend or how to bypass it in case you do not want to install it.

Frontend

Install Packages and Create Virtualenv

First, install the following dependencies (debian packages):

*postgresql postgresql-server-dev-all postgresql-client python3-venv build-essential libxml2-dev libxslt1-dev python3-dev gettext libjpeg-dev libffi-dev libmagickwand-dev gdal-bin*

Note

On Debian 10+ and Ubuntu 18+, libmagickwand has dropped PDF processing for security reason. To reenable you have to change the config to at least read access, e.g. with:

sudo sed -i 's/<policy domain="coder" rights="none" pattern="PDF" \/>/<policy domain="coder" rights="read" pattern="PDF" \/>/' /etc/ImageMagick-6/policy.xml

Make also sure to have pdftk installed.

Then, build a virtual environment to isolate all the python dependencies:

python3 -m venv .virtualenv
source .virtualenv/bin/activate
pip install --upgrade setuptools
pip install --upgrade pip
pip install -r requirements.txt

In case you want to use the development packages, run addiotionally:

pip install -r requirements-dev.txt
Database

Choose a unique database and user name (they can be identical), such as dissemin_myuni. Choose a strong password for your user:

sudo su postgres
psql
CREATE USER dissemin_myuni WITH PASSWORD 'b3a55787b3adc3913c2129205821765d';
ALTER USER dissemin_myuni CREATEDB;
CREATE DATABASE dissemin_myuni WITH OWNER dissemin_myuni;
Search Engine

Dissemin uses the Elasticsearch backend for Haystack. The current supported version is 2.x.x.

Download Elasticsearch and unzip it:

cd elasticsearch-<version>
./bin/elasticsearch    # Add -d to start elasticsearch in the background

Alternatively you can install the .rpm or .deb package, see the documentation of Elasticsearch for further information.

Make sure to set the initial heapsize accordingly.

Backend

Some features in Dissemin rely on an asynchronous tasks backend, celery. If you want to simplify your installation and ignore this asynchronous behaviour, you can add CELERY_ALWAYS_EAGER = True to your dissemin/settings/__init__.py. This way, all asynchronous tasks will be run from the main thread synchronously.

Otherwise, you need to run celery in a separate process. The rest of this subsection explains how.

Redis

The backend communicates with the frontend through a message passing infrastructure. We recommend redis for that (and the source code is configured for it). This serves also as a cache backend (to cache template fragments) and provides locks (to ensure that we do not fetch the publications of a given researcher twice, for instance).

First, install the redis server:

apt-get install redis-server
Celery

You can run Celery either in the shell or as daemon. The letter is recommend for production.

Shell

To run the backend (still in the virtualenv):

celery --app=dissemin.celery:app worker -B -l INFO

The -B option starts the scheduler for periodic tasks, the -l option sets the debug level to INFO.

Daemon

In production you want to run celery and celerybeat as a daemon and be controlled by systemd. celery and celerybeat are installed in the virtual environment of dissemin, so you have to take care to use this environment. In particular you should use the same user for Dissemin and Celery.

You should use the following sample files that are similar to the official sample files. The main differences are a different PYTHONPATH, respect of the virtual environment and stop command for celerybeat. Put this into /etc/default/celery and change CELERY_BIN path.:

# See
# http://docs.celeryproject.org/en/latest/userguide/daemonizing.html

CELERY_APP="dissemin.celery:app"
CELERYD_NODES="dissem"
CELERYD_OPTS=""
CELERY_BIN="/path/to/venv/bin/celery"
CELERYD_PID_FILE="/var/run/celery/%n.pid"
CELERYD_LOG_FILE="/var/log/celery/%n.log"
CELERYD_LOG_LEVEL="INFO"

CELERYBEAT_SCHEDULE_FILE="/var/run/celery/beat-schedule"
CELERYBEAT_PID_FILE="/var/run/celery/beat.pid"
CELERYBEAT_LOG_FILE="/var/log/celery/beat.log"

For the celeryd systemd service put the following in /etc/systemd/system/celery.service and change WorkingDirectory to your dissemin root.:

[Unit]
Description=Celery service
After=network.target

[Service]
Type=forking
User=dissemin
Group=dissemin
Restart=always
EnvironmentFile=-/etc/default/celery
WorkingDirectory=/path/to/dissemin/
ExecStart=/bin/sh -c '${CELERY_BIN} -A ${CELERY_APP} multi start ${CELERYD_NODES} --pidfile=${CELERYD_PID_FILE} --logfile=${CELERYD_LOG_FILE} --loglevel=${CELERYD_LOG_LEVEL} ${CELERYD_OPTS}'
ExecStop=/bin/sh -c '${CELERY_BIN} multi stopwait ${CELERYD_NODES} --pidfile=${CELERYD_PID_FILE}'
ExecReload=/bin/sh -c '${CELERY_BIN} multi restart ${CELERYD_NODES} -A ${CELERY_APP} --pidfile=${CELERYD_PID_FILE} --logfile=${CELERYD_LOG_FILE} --loglevel=${CELERYD_LOG_LEVEL} ${CELERYD_OPTS}'

[Install]
WantedBy=multi-user.target

For the celerybeatd systemd service put the following in /etc/systemd/system/celerybeat.service and change WorkingDirectory to your dissemin root.:

[Unit]
Description=Celerybeat service
After=network.target

[Service]
Type=simple
User=dissemin
Group=dissemin
Restart=always
EnvironmentFile=-/etc/default/celery
WorkingDirectory=/path/to/dissemin/
ExecStart=/bin/sh -c 'PYTHONPATH=$(pwd) ${CELERY_BIN} -A ${CELERY_APP} beat --pidfile=${CELERYBEAT_PID_FILE} --logfile=${CELERYBEAT_LOG_FILE} --loglevel=${CELERYD_LOG_LEVEL} -s ${CELERYBEAT_SCHEDULE_FILE}'
ExecStop=/bin/kill -s TERM $MAINPID

[Install]
WantedBy=multi-user.target

Note that we use /bin/sh -c to process the PYTHONPATH and ${CELERY_BIN}.

To make systemd create the necessary directories with permissions put the follwing into /etc/tmpfiles.d/celery.conf:

d /var/run/celery 0755 dissemin dissemin
d /var/log/celery 0755 dissemin dissemin

After that run systemctl daemon-reload to reload systemd service files and you are ready to use celery and celerybeat with systemd by calling:

systemctl start celery.service
systemctl start celerybeat.service

To make them start on boot call:

systemctl enable celery.service
systemctl enable celerybeat.service
Logrotate

Over time the logfiles of celery tend to get rather big, so you should enable log rotation. Celery does not complain if the log file is removed, it just opens it again.

Configuration

Configure the Application for Development or Production

Finally, create a file dissemin/settings/__init__.py with this content:

# Development settings
from .dev import *
# Production settings.
from .prod import *
# Pick only one.

For most of the settings we refer to the Django documentation.

Logs

Dissemin comes with a predefined log system. You can change the settings in dissemin/settings/common.py and change the default log level for production and development in the corresponding files. When using Dissemin from the shell with ./manage shell you can set the log level for console output as environment variable with:

export DISSEMIN_LOGLEVEL='YOUR_LOG_LEVEL'

When using in production make sure that apache collects all your log message. Alternatively you can send them to a separate file by changing log settings.

Sentry

Dissemin uses Sentry to monitor severe errors. To enable Sentry, set the SENTRY_DSN.

ORCID

You can either use production ORCID or its sandbox. The main difference is the registration process.

You are not forced to configure ORCID to work on Dissemin, just create a super user and use it!

Production

On your ORCID account got to Developer Tools and register an API key. As a redirection URL you give the URL to your installation.

Set ORCID_BASE_DOMAIN to orcid.org in the Dissemin settings.

On the admin surface got to Social Authentication, set the provider to orcid.org and enter the required data.

Now you can authenticate with ORCID.

Sandbox

Create an account on Sandbox ORCID.

Go to Developer Tools, verify your mail using Mailinator. You must not choose a different provider.

Set up a redirection URI to be localhost:8080 (supposed to be where your Dissemin instance server is running).

Now proceed as in Production, but with sandbox.orcid.org.

Repositories

By default, Dissemin is not configured to let users deposit in any repository. To enable this feature, you need to add repositories by visiting the Django admin interface at /admin/deposit/repository/ and creating a new repository there. Depending on the repository, you will need to supply various data fields.

In all cases you need to provide a name for the repository, a short description which will be shown to users, and a logo.

HAL

To set up the connection with HAL, use the following settings:

Zenodo

To set up the connection with Zenodo, use the following settings:

Shibboleth

Shibboleth is a SAML based authentication mechanism and widely used in academic research. CAPSH has joined the French federation RENATER in order to provide a login with eduGAIN. In the SAML world there is usually an IdentityProvider (IdP) that permits (local) authentication and a Service Provider (SP) that offers some kind of service. In this case, https://dissem.in/ will be the SP.

Relevant documentations can be found at Shibboleth and RENATER. They cover some of the understanding of how Shibboleth works as well as instructions on participating and register a SP.

The entityID for our production service is https://sp.dissem.in/shibboleth while we use https://sp.sandbox.dissem.in/shibboleth for our sandbox. The guide assumes in the following, that production and sandbox run on the same machine.

Installation

Shibboleth requires mod_shibboleth and a daemon. Official packages are available for RedHat and openSUSE. For Ubuntu and Debian based system, please follow the guide from SWITCHaai.

The certs and keys for signing and encryption might be missing. They can be self signed certificates. To generate them, run:

openssl req -config config-cert.conf -new -x509 -nodes -days 1095 -keyout sp-encrypt-key.pem -out sp-encrypt-cert.pem
openssl req -config config-cert.conf -new -x509 -nodes -days 1095 -keyout sp-signing-key.pem -out sp-signing-cert.pem

where the config file looks like:

[ req ]
default_bits = 4096
distinguished_name = req_distinguished_name
prompt = no
x509_extensions = req_ext

[ req_distinguished_name ]
C = FR
O = CAPSH
CN = dissem.in
emailAddress = team@dissem.in

[req_ext ]
subjectAltName = @alt_names

[ alt_names ]
DNS.1 = dissem.in
DNS.2 = sandbox.dissem.in
DNS.3 = https://sp.dissem.in/shibboleth # entityID production
DNS.4 = https://sp.sandbox.dissem.in/shibboleth # entityID sandbox

Warning

When the certificates expire and we have to renew them, we must communicate to RENATER! For a short period of time we have to provide both certificates, the old and new ones, so that the IdPs can update to the new one and the transition is seamless.

Note

In theory, we can use the same certificate as for the https server, but this is disadvantageous with Let’s Encrypt since with every new certificate, we would need to change our shibboleth metadata.

shibboleth2.xml

This is the central configuration file for Shibboleth where the magic happens. After a change of the configuration, touch the file, to tell the shibboleth deamon to reload. This does not disturb the service. Depending on the changes, the metadata for our entityId change.

Since RENATER offers a production as well a test federation, we need to create different metadata. This will be done via ApplicationOverride as there are little differences only, that must be set explicetely:

  • entityID
  • discoveryURL
  • MetadataProvider
  • MetadataGenerator

You can find our (sample) shibboleth2.xml als well as our attribute-map.xml in our GibLab repository. Check the folder provisioning.

Make also sure, that the settings comply with SAML Metadata Published by RENATER.

Apache

In order to make Shibboleth available on the virtual host, add:

<Location /Shibboleth.sso>
    setHandler shib
</Location>

This way Shibboleth gets precedence over WSGI for /Shibboleth.sso. In theory, you could use any other alias, but this is somewhat of a standard.

For our sandbox, make sure to add:

<Location />
    ShibRequestSetting applicationId sandbox
    AuthType shibboleth
    Require shibboleth
</Location>

right before the WSGI-part. This makes sure to use the ApplicationOverride for sandbox that we mentioned above.

Django

In Django, only a few things need to be configured. You need to set SHIB_DS_SP_URL which is the URL that leads to the Daemons Login site, which will perform a redirect to the choosen IdP. This is for production https://dissem.in/Shibboleth.sso/Login. Then you will have to point to the DiscoFeed. You can do this either by pointing to a URL or file, usually the URL is fine and is for production https://dissem.in/Shibboleth.sso/DiscoFeed.

Note

In development settings, both are predefined and there’s no necessarity to change them. However, an authentication won’t be possible, because the value are somewhat made up.

Then you need to set SHIBBOLETH_LOGOUT_URL in the settings. This points to the Daemons logout site, e.g. https://dissem.in/Shibboleth.sso/Logout and logs out the user from the Daemon. The user is eventually redirected to Dissemins start page.

On development environment you shouldn’t set this value.

Troubleshooting

Systemd timeout

Under certain circumstances shibd does take along time to start. This is due to the fact that we process the whole eduGAIN IdP metadata. The crucial time killer is the validation of signatures.

Usually this is only an issue when starting shibd for the first time, since cached IdPs won’t be validated again.

There are three ways to solve this:

  1. Increase timeout on systemd for shibd
  2. Stop shibd and initialize it manually
  3. Turn off validation.

Of course, 3. is not an option!

The standard approach to solve this is usually to use MDQ, where IdPs will be checked in case of need. This system is not (yet) suitable for a DiscoveryService since it needs to know all IdPs.

Missing attributes

Albeit we postulated our demanded attributes within eduGAIN, this does mean, that an IdP will release the requested attributes. It is up to the IdP which attributes it releases to a SP. Ususally they will ship eduPersonTargetedId, surname and givenName. In case of need we can ask the IdP to release more attributes.

Deploying Dissemin

You have two options to run the web server: development or production settings.

Development

Simply run ./launch.sh. This uses the default Django server and serves the website locally on the port 8080. Note that the standard port for django-admins runserver-command is _8000_, but this ensures compatibility with the Vagrant installation.

This runs with DEBUG = True, which means that Django will report to the user any internal error in a transparent way. This is useful to debug your installation but should not be used for production as it exposes your internal settings.

Whenever you have a change of the database layout, run:

./manage migrate

Production

As any Django website, Dissemin can be served by various web servers. These settings are not specific to Dissemin itself so you should refer to the relevant Django documentation.

There are some deployment steps that you always have to do in case of deployment (which includes rolling out updates). You should keep this order. Make sure to have the virtual environment activated.

  1. Upgrade requirements with pip install --upgrade -r requirements.txt
  2. Apply migrations with ./manage.py migrate
  3. Compile scss files with ./manage.py compilescss
  4. Collect static files with ./manage.py collectstatic --ignore=*.scss
  5. Compile translations with ./manage.py compilemessages --exclude qqq
  6. Tell WSGI to reload with touch dissemin/wsgi.py
  7. Restart celery with systemctl

Make sure that your media/ directory is writable by the user under which the application will run (www-data on Debian).

Self-hosting MathJax

Dissemin requires MathJax for rendering LaTeX formatting in the abstracts. Out of the box, Dissemin will use a CDN-hosted version of MathJax.

An easy solution to this is to self-host MathJax. You can follow the installation instructions from MathJax to get a local copy. Ideally, you should put it in the static directory (under /home/dissemin/www/static/ in the example below).

Note that MathJax consists of many small files which can slow down a lot the built-in Django webserver. Hence, it is better to serve it directly by Apache and avoid having all these files in the papers/static/libs directory of Dissemin.

Once MathJax is downloaded and available by your webserver, you can use the setting MATHJAX_SELFHOST_URL (in dissemin/settings) to specify a location to load MathJax from. In the example below, this would be //dissemin.myuni.edu/static/mathjax/MathJax.js?config=TeX-AMS-MML_HTMLorMML.

Apache with WSGI

A sample VirtualHost, assuming that the root of the Dissemin source code is at /home/dissemin/prod and you use a python3.7 virtualenv is available in the Dissemin Git repository.

You should only have to change the path to the application and the domain name of the service.

Administration of Dissemin

Configuring Institutions

Configuring institutions is rarely necessary, because their are maintained by the backend, i.e. the institutions are created, merged and deleted automatically.

Identifiers

Each institution has a list of identifiers. The identifiers with the prefix shib: are usually added manually. We use this identifier to match a shibboleth authenticated user with a institution without saving the user institution in the database. The identifier itself is usually the IdP of the institution and extracted from eduPersonTargetedID that has the format IdP!SP!user_identifier.

Repositories

If a repository is associated to an institution, it will be preselected in case of a deposit and the user has a relation to the institute.

Configuring Repositories

Configuring a repository involves not just the repository itself, but other settings related.

Repositories

On the admin site in the section Deposit you find your repositories. You can add and modify them.

You have the following options:

Name
The name of the repository
Description
The description of the repository. This is shown to the user. You cannot use any markup.
URL
URL of the repository, ex: https://arXiv.org/
Logo
A logo of your repository or it’s hosting institution. This is shown to the user.
Protocol
You can choose a protocol to use for the deposit.
OAISource
The source with which the OaiRecords associated with the deposits are created
Username and Password
If your repository uses password authentication, fill in these values
Api_key
If your repository uses an API key, fill it in here
Endpoint
URL to call when depositing
Update Status URL
If not empty, from this URL the SWORD protocol refreshes the deposit status. The id of the repository entry is inserted at {}, so the URL must contain {}.
Enabled
Here you can enable or disable the repository. Disabling means that the repository refuses to deposit and is not shown to the user.
Abstract required
Define wether the user must enter an abstract. Usually abstracts can be fetched from Zotero. Default is: Yes.
Embargo
Defines wether an embargo is required, optional or not used by the repository.
Green open access service
Set a custom text shown to the user after depositing in this repository. Leave this empty if you do not want a message shown to the user. See Green Open Access Service
DDC
Here you can choose some DDC for the repository. If no DDC is selected, the user won’t be bothered at all. See also: DDC
License chooser

Here you can add a list of license that your repository accepts. There is no limit, but you should keep your selection moderate There are the following options:

Transmit id
Value to identify the license in the metadata.
Position
The position in which the corresponding license shall be shown to the user. The behaviour is as for pythons list. The position is per repository. However, using -1 can lead to a higher number. This is kind of a bug, but does no harm. Also some propulated values might not start with a 0.
Default
Check, if you want to use this as default license. In the resulting list on the deposit page, this license is preselected. If you do not set any check, the first license (in alphabetical order) is used. If you check more than one, thie first of the checked license (in alphabetical order) is used. Both cases lead to a warning.

See also: Licenses

Note

Our implementation of the HAL protocol does not uses licenses or DDC. Our implementation of the Zenodo protocol does not use DDC.

DDC

On this site you can define DDC classes. You can choose any of the classes from 000 to 999. Leading zeros are automatically added when calling __str__ of object. Set parent to make a group. Parent can be any of the class in 100*i for i in range(10). This groups the DDC when displayed to the user.

Note

You can localize your DDC name, see Localization for further information.

Green Open Access Service

Here you can set a text about a possible alert after the user deposits into a repository. The GOAS object requires

  • heading - Heading, e.g. the name of the service.
  • text - A short text displayed to the user.
  • learn_more_url - URL to the webpage with more information about this service

Note

You can localize your Green Open Access Service alerts, see Localization for further information.

See also Green Open Access Service.

Licenses

On the admin site in the section Deposit you find the licenses. You can add and modify them.

Each license consists of its name and its URI. If your license does not provide a URI, you can use the namespace https://dissem.in/deposit/license/.

Note

You can localize your licenses name, see Localization for further information.

Insitution

To match a repository with an institution, see Configuring Institutions.

Creating a Letter of Declaration

The letter of declaration is a sensitive document since it has a legal character.

To maintain the legal character, Dissemin does ship to letter of declaration as it is designed by the repository administration.

There are three ways to handle this:

  1. Serve the user the letter and let him fill in everything
  2. Fill in the letter with forms
  3. Set the letter in python using reportlab

The second way is the least effort and keeps the corporate design easily.

First, inspect the pdf forms with pdftk using pdftk <pdf> dump_data_fields > fields.txt. Then in fields.txt you can see the form fields and their names. Identify the names and values.

Next, place the file with a meaningful name in deposit/declarations/pdf_templates/.

Now, some things need to be coded in Python. In deposit/declaration.py add a new function. Let it create a list of (Field name, Value) with the necessary values and pass it together with the path to the file to the function fill_forms. By default, all forms will be replaced with plain text. If you want to keep the forms, pass flatten=False als additional parameter. The return value of fill_forms is a io.BytesIO that you just return.

In order to make your new function available to the repository, add the function with a meaningful key to REGISTERED_DECLARATION_FUNCTIONS.

In the admin section you can then add a new letter of declaration.

Here you can set a text about a possible alert after the user deposits into a repository. The object requires

  • heading - Heading, e.g. ‘Contract required!’.
  • text - A short text displayed to the user.
  • url_text - Text of the download button.
  • url - The URL to an online form
  • function key - The function that generates the letter.

Note

You can localize your letter of declaration alerts, see Localization for further information.

See also Letter of Declaration.

After this is done, you can choose a letter od declaration object for your repository and it will display!