Welcome to Dipper’s documentation!

Dipper is a Python package to generate RDF triples from common scientific resources. Dipper includes subpackages and modules to create graphical models of this data, including:

  • Models package for generating common sets of triples, including common OWL axioms, complex genotypes, associations, evidence and provenance models.
  • Graph package for building graphs with RDFLib or streaming n-triples
  • Source package containing fetchers and parsers that interface with remote databases and web services

Getting started

Installing, running, and the basics

Installation

Dipper requires Python version 3.5 or higher

Install with pip:

pip install dipper

Development version

The development version can be pulled from GitHub.

pip3 install git+git://github.com/monarch-initiative/dipper.git

Building locally

git clone https://github.com/monarch-initiative/dipper.git
cd dipper
pip install .

Alternatively, a subset of source specific requirements may be downloaded. To download the core requirements:

pip install -r requirements.txt

To download source specific requirements use the requirements/ directory, for example:

pip install -r requirements/mgi.txt

To download requirements for all sources:

pip install -r requirements/all-sources.txt

Getting started with Dipper

This guide assumes you have already installed dipper. If not, then follow the steps in the Installation section.

Command line

You can run the code by supplying a list of one or more sources on the command line. some examples:

dipper-etl.py --sources omim,ncbigene

Furthermore, you can check things out by supplying a limit. this will only process the first N number of rows or data elements:

dipper-etl.py --sources hpoa --limit 100

Other command line parameters are explained if you request help:

dipper-etl.py --help

Notebooks

We provide Jupyter Notebooks to illustrate the functionality of the python library. These can also be used interactively.

See the Notebooks section for more details.

Building models

This code example shows some of the basics of building RDF graphs using the models API:

import pandas as pd
from dipper.graph.RDFGraph import RDFGraph
from dipper.models.Model import Model

columns = ['variant', 'variant_label', 'variant_type',
           'phenotype','relation', 'source', 'evidence', 'dbxref']

data =  [
     ['ClinVarVariant:254143', 'C326F', 'SO:0000694',
      'HP:0000504','RO:0002200', 'PMID:12503095', 'ECO:0000220',
      'dbSNP:886037891']
]

# Initialize graph and model
graph = RDFGraph()
model = Model(graph)

# Read file
dataframe = pd.DataFrame(data=data, columns=columns)

for index, row in dataframe.iterrows():
   # Add the triple ClinVarVariant:254143 RO:0002200 HP:0000504
   # RO:0002200 is the has_phenotype relation
   # HP:0000748 is the phenotype 'Inappropriate laughter'
   model.addTriple(row['variant'], row['relation'], row['phenotype'])

   # The addLabel method adds a label using the rdfs:label relation
   model.addLabel(row['variant'], row['variant_label'])

   # addType makes the variant an individual of a class,
   # in this case SO:0000694 'SNP'
   model.addType(row['variant'], row['variant_type'])

   # addXref uses the relation OIO:hasDbXref
   model.addXref(row['variant'], row['dbxref'])

   # Serialize the graph as turtle
   print(graph.serialize(format='turtle').decode("utf-8"))

For more information see the Working with the models API section.

Notebooks

Jupyter notebook examples

We use Jupyter Notebooks

Available tutorials include:

Running jupyter locally

Follow the instructions for installing from GitHub in Installation. Then start a notebook browser with:

pip install jupyter
PYTHONPATH=. jupyter notebook ./docs/notebooks

Downloads

RDF

The dipper output is quality checked and released on a regular basis. The latest release can be found here:

The output from our development branch are made available here (may contain errors):

TSV

TSV downloads for common queries can be found here:

Neo4J

A dump of our Neo4J database that includes the output from dipper plus various ontologies:

A public version can be accessed via the SciGraph REST API:

Ingest status

We use Jenkins to periodically build each source. A dashboard containing the current status of each ingest can be found here:

Applications

Monarch Initiative

The Monarch application is powered in part by Dipper:

Owlsim

Annotations loaded into Owlsim are from the Dipper/SciGraph pipeline:

Deeper into Dipper

A look into the structure of the codebase and how to write ingests

Working with graphs

The Dipper graph package provides two graph implementations, a RDFGraph which is an extension of the RDFLib [1] Graph [2], and a StreamedGraph which prints triples to standard out in the ntriple format.

RDFGraphs

The RDFGraph class reads the curie_map.yaml file and converts strings formatted as curies to RDFLib URIRefs. Triples are added via the addTriple method, for example:

from dipper.graph.RDFGraph import RDFGraph

graph = RDFGraph()
graph.addTriple('foaf:John', 'foaf:knows', 'foaf:Joseph')

The graph can then be serialized in a variety of formats using RDFLib [3]:

from dipper.graph.RDFGraph import RDFGraph

graph = RDFGraph()
graph.addTriple('foaf:John', 'foaf:knows', 'foaf:Joseph')
print(graph.serialize(format='turtle').decode("utf-8"))

# Or write to file
graph.serialize(destination="/path/to/output.ttl", format='turtle')

Prints:

@prefix OBO: <http://purl.obolibrary.org/obo/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

foaf:John foaf:knows foaf:Joseph .

When an object is a literal, set the object_is_literal param to True

from dipper.graph.RDFGraph import RDFGraph

graph = RDFGraph()
graph.addTriple('foaf:John', 'rdfs:label', 'John', object_is_literal=True)

Literal types can also be passed into the method:

from dipper.graph.RDFGraph import RDFGraph

graph = RDFGraph()
graph.addTriple(
    'foaf:John', 'foaf:age', 12,
     object_is_literal=True, literal_type="xsd:integer"
)

StreamedGraphs

StreamedGraphs print triples as they are processed by the addTriple method. This is useful for large sources where. The output should be sorted and uniquified as there is no checking for duplicate triples. For example:

from dipper.graph.StreamedGraph import StreamedGraph

graph = StreamedGraph()
graph.addTriple('foaf:John', 'foaf:knows', 'foaf:Joseph')

Prints:

<http://xmlns.com/foaf/0.1/John> <http://xmlns.com/foaf/0.1/knows> <http://xmlns.com/foaf/0.1/Joseph> .

Working with the models API

The models package provides classes for building common sets of triples based on our modeling patterns.

For an example see the notebook on this topic: Building graphs with the model API

Basics

The models class provides methods for building common RDF and OWL statements

For a list of methods, see the API docs.

Building associations

We use the RDF Reification [1] pattern to create ternary statements, for example, adding frequency data to phenotype to disease associations. We utilize the Open Biomedical Association ontology [2] to reify statements, and the SEPIO ontology to add evidence and provenance.

For a list of classes and methods, see the API docs.

Building genotypes

We use the GENO ontology [4] to build complex genotypes and their parts.

For a list of methods, see the API docs.

GENO docs: The Genotype Ontology (GENO)

Building complex evidence and provenance graphs

We use the SEPIO ontology to build complex evidence and provenance. For an example see the IMPC source ingest.

For a list of methods, see the API docs for evidence and provenance.

SEPIO docs: The Scientific Evidence and Provenance Information Ontology (SEPIO)

Writing ingests with the source API

Overview

Although not required to write an ingest, we have provided a source parent class that can be extended to leverage reusable functionality in each ingest.

To create a new ingest using this method, first extend the Source class.

If the source contains flat files, include a files dictionary with this structure:

files = {
    'somekey': {
        'file': 'filename.tsv',
        'url': 'http://example.org/filename.tsv'
    },
    ...
}

For example:

from dipper.sources.Source import Source


class TPO(Source):
"""
The ToxicoPhenomicOmicsDB contains data on ...
"""

files = {
    'genes': {
        'file': 'genes.tsv',
        'url': 'http://example.org/genes.tsv'
    }
}

Initializing the class

Each source class takes a graph_type (string) and are_bnodes_skolemized (boolean) parameters. These parameters are used to initialize a graph object in the Source constructor.

Note: In the future this may be adjusted so that a graph object is passed into each source.

For example:

def __init__(self, graph_type, are_bnodes_skolemized):
    super().__init__(graph_type, are_bnodes_skolemized, 'TPO')

Writing the fetcher

This method is intended to fetch data from the remote locations (if it is newer than the local copy).

Extend the parent fetch function. If a the remote file has already been downloaded. The fetch method checks the remote headers to see if it has been updated. For sources not served over HTTP, this method may need to be overriden, for example in Bgee.

For example:

def fetch(self, is_dl_forced=False):
    """
    Fetches files from TPO

    :param is_dl_forced (bool): Force download
    :return None
    """
    self.get_files(is_dl_forced)

Writing the parser

Typically these are written by looping through the series of files that were obtained by the fetch method. The goal is to process each file minimally, adding classes and individuals as necessary, and adding triples to the sources’ graph.

For example:

def parse(self, limit=None):
    """
    Parses genes from TPO

    :param limit (int, optional) limit the number of rows processed
    :return None
    """
    if limit is not None:
        logger.info("Only parsing first %d rows", limit)

    # Open file
    fh = open('/'.join((self.rawdir, self.files['genes']['file'])), 'r')
    # Parse file
    self._add_gene_toxicology(fh, limit)
    # Close file
    fh.close()
Considerations when writing a parser

There are certain conventions that we follow when parsing data:

1. Genes are a special case of genomic feature that are added as (OWL) Classes. But all other genomic features are added as individuals of an owl class.

2. If a source references an external identifier then, assume that it has been processed in another source script, and only add the identifier (but not the label) to it within this source’s file. This will help prevent label collisions related to slightly different versions of the source data when integrating downstream.

3. You can instantiate a class or individual as many times as you want; they will get merged in the graph and will only show up once in the resulting output.

Testing ingests

Unit tests

Unit style tests can be achieved by mocking source classes (or specific functions) and testing single functions. The test_graph_equality function can be used to test graph equality by supplying a string formatted as headless (no prefixes) turtle and a graph object. Most dipper methods are not pure functions, and rely on side effects to a graph object. Therefore it is best to clear the graph object before any testing logic, eg:

from dipper.utils.TestUtils import TestUtils

source.graph = RDFGraph(True)  # Reset graph
test_util = TestUtils()
source.run_some_function()
expected_triples = """
    foaf:person1 foaf:knows foaf:person2 .
"""
self.assertTrue(self.test_util.test_graph_equality(
    expected_triples, source.graph))

Integration tests

Integration tests can be executed by generating a file that contains a subset of a source’s data in the same format, and running it through the source.parse() method, serializing the graph, and then testing this file in some other piece of code or database.

You may see testing code within source classes, but these tests will be deleted or refactored and moved to the test directory.

Configuring dipper with keys and passwords

Add private configuration parameters into your private conf.json file. Examples of items to put into the config include:

  • database connection parameters (in the “dbauth” object)
  • ftp login credentials
  • api keys (in the “keys” object)

These are organized such that within any object (dbauth, keys, etc), they are keyed again by the source’s name.

Here is an example:

{
  "keys": {
    "omim" : "foo",
  },
 "dbauth" : {
   "mgi" : {
     "user" : "bar",
     "password" : "baz"
   }
}

This file must be placed in the dipper package directory and named conf.json. If building locally this is in the dipper/dipper/ directory. If installed with pip this will be in path/to/env/lib/python3.x/site-packages/dipper/ directory.

Schemas

Although RDF is inherently schemaless, we aim to construct consistent models across sources. This allows us to build source agnostic queries and bridge data across sources.

The dipper schemas are documented as directed graphs. Examples can be found in the ingest artifacts repo.

Some ontologies contain documentation on how to describe data using the classes and properties defined in the ontology:

While not yet implemented, in the future we plan on defining our schemas and constraints using the BioLink model specification.

The cypher queries that we use to cache inferred and direct relationships between entities are stored in GitHub.

For developers

API Docs

dipper package

Subpackages
dipper.graph package
Submodules
dipper.graph.Graph module
class dipper.graph.Graph.Graph

Bases: object

addTriple(subject_id, predicate_id, object_id, object_is_literal, literal_type)
serialize(subject_iri, predicate_iri, obj, object_is_literal, literal_type)
skolemizeBlankNode(curie)
dipper.graph.RDFGraph module
class dipper.graph.RDFGraph.RDFGraph(are_bnodes_skized=True, identifier=None)

Bases: rdflib.graph.ConjunctiveGraph, dipper.graph.Graph.Graph

Extends RDFLibs ConjunctiveGraph The goal of this class is wrap the creation of triples and manage creation of URIRef, Bnodes, and literals from an input curie

addTriple(subject_id, predicate_id, obj, object_is_literal=False, literal_type=None)
bind_all_namespaces()
curie_util = <dipper.utils.CurieUtil.CurieUtil object>
skolemizeBlankNode(curie)
dipper.graph.StreamedGraph module
class dipper.graph.StreamedGraph.StreamedGraph(are_bnodes_skized=True, file_handle=None, fmt='nt')

Bases: dipper.graph.Graph.Graph

Stream rdf triples to file or stdout Assumes a downstream process will sort then uniquify triples

Theoretically could support both ntriple, rdfxml formats, for now just support nt

addTriple(subject_id, predicate_id, object_id, object_is_literal=False, literal_type=None)
curie_util = <dipper.utils.CurieUtil.CurieUtil object>
serialize(subject_iri, predicate_iri, obj, object_is_literal=False, literal_type=None)
skolemizeBlankNode(curie)
dipper.models package
Subpackages
dipper.models.assoc package
Submodules
dipper.models.assoc.Association module
class dipper.models.assoc.Association.Assoc(graph, definedby, sub=None, obj=None, pred=None)

Bases: object

A base class for OBAN (Monarch)-style associations, to enable attribution of source and evidence on statements.

add_association_to_graph()
add_date(date)
add_evidence(identifier)

Add an evidence code to the association object (maintained as a list) :param identifier:

Returns:
add_predicate_object(predicate, object_node, object_type=None, datatype=None)
add_provenance(identifier)
add_source(identifier)

Add a source identifier (such as publication id) to the association object (maintained as a list) TODO we need to greatly expand this function!

Parameters:identifier
Returns:
annotation_properties = {'consider': 'OIO:consider', 'definition': 'IAO:0000115', 'hasExactSynonym': 'OIO:hasExactSynonym', 'hasRelatedSynonym': 'OIO:hasRelatedSynonym', 'has_xref': 'OIO:hasDbXref', 'inchi_key': 'CHEBI:InChIKey', 'probabalistic_quantifier': 'GENO:0000867', 'replaced_by': 'IAO:0100001'}
assoc_types = {'association': 'OBAN:association'}
datatype_properties = {'created_on': 'pav:createdOn', 'has_measurement': 'IAO:0000004', 'has_quantifier': 'GENO:0000866', 'position': 'faldo:position'}
get_association_id()
get_properties()
static make_association_id(definedby, subject, predicate, object, attributes=None)

A method to create unique identifiers for OBAN-style associations, based on all the parts of the association If any of the items is empty or None, it will convert it to blank. It effectively digests the string of concatonated values. Subclasses of Assoc can submit an additional array of attributes that will be appeded to the ID.

Parameters:
  • definedby – The (data) resource that provided the annotation
  • subject
  • predicate
  • object
  • attributes
Returns:

object_properties = {'causes_or_contributes': 'RO:0003302', 'expressed_in': 'RO:0002206', 'has disposition': 'RO:0000091', 'has_evidence': 'RO:0002558', 'has_object': 'OBAN:association_has_object', 'has_phenotype': 'RO:0002200', 'has_predicate': 'OBAN:association_has_predicate', 'has_provenance': 'OBAN:has_provenance', 'has_quality': 'RO:0000086', 'has_source': 'dc:source', 'has_subject': 'OBAN:association_has_subject', 'in_taxon': 'RO:0002162', 'is_about': 'IAO:0000136', 'towards': 'RO:0002503'}
properties = {'causes_or_contributes': 'RO:0003302', 'consider': 'OIO:consider', 'created_on': 'pav:createdOn', 'definition': 'IAO:0000115', 'expressed_in': 'RO:0002206', 'has disposition': 'RO:0000091', 'hasExactSynonym': 'OIO:hasExactSynonym', 'hasRelatedSynonym': 'OIO:hasRelatedSynonym', 'has_evidence': 'RO:0002558', 'has_measurement': 'IAO:0000004', 'has_object': 'OBAN:association_has_object', 'has_phenotype': 'RO:0002200', 'has_predicate': 'OBAN:association_has_predicate', 'has_provenance': 'OBAN:has_provenance', 'has_quality': 'RO:0000086', 'has_quantifier': 'GENO:0000866', 'has_source': 'dc:source', 'has_subject': 'OBAN:association_has_subject', 'has_xref': 'OIO:hasDbXref', 'in_taxon': 'RO:0002162', 'inchi_key': 'CHEBI:InChIKey', 'is_about': 'IAO:0000136', 'position': 'faldo:position', 'probabalistic_quantifier': 'GENO:0000867', 'replaced_by': 'IAO:0100001', 'towards': 'RO:0002503'}
set_association_id(assoc_id=None)

This will set the association ID based on the internal parts of the association. To be used in cases where an external association identifier should be used.

Parameters:assoc_id
Returns:
set_description(description)
set_object(identifier)
set_relationship(identifier)
set_score(score, unit=None, score_type=None)
set_subject(identifier)
dipper.models.assoc.Chem2DiseaseAssoc module
class dipper.models.assoc.Chem2DiseaseAssoc.Chem2DiseaseAssoc(graph, definedby, chem_id, phenotype_id, rel_id=None)

Bases: dipper.models.assoc.Association.Assoc

Attributes: assoc_id (str): Association Curie (Prefix:ID) chem_id (str): Chemical Curie phenotype_id (str): Phenotype Curie pub_list (str,list): One or more publication curies rel (str): Property relating assoc_id and chem_id evidence (str): Evidence curie

make_c2p_assoc_id()
set_association_id(assoc_id=None)

This will set the association ID based on the internal parts of the association. To be used in cases where an external association identifier should be used.

Parameters:assoc_id
Returns:
dipper.models.assoc.D2PAssoc module
class dipper.models.assoc.D2PAssoc.D2PAssoc(graph, definedby, disease_id, phenotype_id, onset=None, frequency=None, rel=None)

Bases: dipper.models.assoc.Association.Assoc

A specific association class for defining Disease-to-Phenotype relationships This assumes that a graph is created outside of this class, and nodes get added. By default, an association will assume the “has_phenotype” relationship, unless otherwise specified.

add_association_to_graph()

The reified relationship between a disease and a phenotype is decorated with some provenance information. This makes the assumption that both the disease and phenotype are classes.

Parameters:g
Returns:
d2p_object_properties = {'frequency': ':frequencyOfPhenotype', 'onset': ':onset'}
make_d2p_id()

Make an association id for phenotypic associations with disease that is defined by: source of association + disease + relationship + phenotype + onset + frequency

Returns:
set_association_id(assoc_id=None)

This will set the association ID based on the internal parts of the association. To be used in cases where an external association identifier should be used.

Parameters:assoc_id
Returns:
dipper.models.assoc.DispositionAssoc module
class dipper.models.assoc.DispositionAssoc.DispositionAssoc(graph, definedby, entity_id, heritability_id)

Bases: dipper.models.assoc.Association.Assoc

A specific Association model for Heritability annotations. These are to be used between diseases and a heritability disposition.

dipper.models.assoc.G2PAssoc module
class dipper.models.assoc.G2PAssoc.G2PAssoc(graph, definedby, entity_id, phenotype_id, rel=None)

Bases: dipper.models.assoc.Association.Assoc

A specific association class for defining Genotype-to-Phenotype relationships. This assumes that a graph is created outside of this class, and nodes get added. By default, an association will assume the “has_phenotype” relationship, unless otherwise specified. Note that genotypes are expected to be created and defined outside of this association, most likely by calling methods in the Genotype() class.

add_association_to_graph()

Overrides Association by including bnode support

The reified relationship between a genotype (or any genotype part) and a phenotype is decorated with some provenance information. This makes the assumption that both the genotype and phenotype are classes.

currently hardcoded to map the annotation to the monarch namespace :param g: :return:

g2p_types = {'developmental_process': 'GO:0032502'}
make_g2p_id()

Make an association id for phenotypic associations that is defined by: source of association + (Annot subject) + relationship + phenotype/disease + environment + start stage + end stage

Returns:
set_association_id(assoc_id=None)

This will set the association ID based on the internal parts of the association. To be used in cases where an external association identifier should be used.

Parameters:assoc_id
Returns:
set_environment(environment_id)
set_stage(start_stage_id, end_stage_id)
dipper.models.assoc.InteractionAssoc module
class dipper.models.assoc.InteractionAssoc.InteractionAssoc(graph, definedby, subj, obj, rel=None)

Bases: dipper.models.assoc.Association.Assoc

interaction_object_properties = {'colocalizes_with': 'RO:0002325', 'genetically_interacts_with': 'RO:0002435', 'interacts_with': 'RO:0002434', 'molecularly_interacts_with': 'RO:0002436', 'negatively_regulates': 'RO:0003002', 'positively_regulates': 'RO:0003003', 'regulates': 'RO:0002448', 'ubiquitinates': 'RO:0002480'}
dipper.models.assoc.OrthologyAssoc module
class dipper.models.assoc.OrthologyAssoc.OrthologyAssoc(graph, definedby, gene1, gene2, rel=None)

Bases: dipper.models.assoc.Association.Assoc

add_gene_family_to_graph(family_id)

Make an association between a group of genes and some grouping class. We make the assumption that the genes in the association are part of the supplied family_id, and that the genes have already been declared as classes elsewhere. The family_id is added as an individual of type DATA:gene_family.

Triples: <family_id> a DATA:gene_family <family_id> RO:has_member <gene1> <family_id> RO:has_member <gene2>

Parameters:
  • family_id
  • g – the graph to modify
Returns:

ortho_rel = {'has_member': 'RO:0002351', 'homologous': 'RO:HOM0000019', 'in_paralogous': 'RO:HOM0000023', 'least_diverged_orthologous': 'RO:HOM0000020', 'ohnologous': 'RO:HOM0000022', 'orthologous': 'RO:HOM0000017', 'paralogous': 'RO:HOM0000011', 'xenologous': 'RO:HOM0000018'}
terms = {'gene_family': 'DATA:3148'}
Submodules
dipper.models.Dataset module
class dipper.models.Dataset.Dataset(identifier, title, url, description=None, license_url=None, data_rights=None, graph_type=None, file_handle=None)

Bases: object

this will produce the metadata about a dataset following the example laid out here: http://htmlpreview.github.io/? https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/Overview.html#appendix_1 (mind the wrap)

getGraph()
get_license()
setFileAccessUrl(url, is_object_literal=False)
setVersion(date_issued, version_id=None)
Legacy function…
should use the other set_* for version and date

as of 2016-10-20 used in:

dipper/sources/HPOAnnotations.py 139: dipper/sources/CTD.py 99: dipper/sources/BioGrid.py 100: dipper/sources/MGI.py 255: dipper/sources/EOM.py 93: dipper/sources/Coriell.py 200: dipper/sources/MMRRC.py 77:

# TODO set as deprecated

Parameters:
  • date_issued
  • version_id
Returns:

set_citation(citation_id)
set_date_issued(date_issued)
set_license(license)
set_version_by_date(date_issued=None)

This will set the version by the date supplied, the date already stored in the dataset description, or by the download date (today) :param date_issued: :return:

set_version_by_num(version_num)
dipper.models.Environment module
class dipper.models.Environment.Environment(graph)

Bases: object

These methods provide convenient methods to add items related to an experimental environment and it’s parts to a supplied graph.

This is a stub ready for expansion.

addComponentAttributes(component_id, entity_id, value=None, unit=None)
addComponentToEnvironment(env_id, component_id)
addEnvironment(env_id, env_label, env_type=None, env_description=None)
addEnvironmentalCondition(cond_id, cond_label, cond_type=None, cond_description=None)
annotation_properties = {}
environment_parts = {'crispr_reagent': 'REO:crispr_TBD', 'environmental_condition': 'XCO:0000000', 'environmental_system': 'ENVO:01000254', 'morpholio_reagent': 'REO:0000042', 'talen_reagent': 'REO:0001022'}
object_properties = {'has_part': 'BFO:0000051'}
properties = {'has_part': 'BFO:0000051'}
dipper.models.Evidence module
class dipper.models.Evidence.Evidence(graph, association)

Bases: object

To model evidence as the basis for an association. This encompasses:

  • measurements taken from the lab, and their significance.
    these can be derived from papers or other agents.
  • papers
>1 measurement may result from an assay,
each of which may have it’s own significance
add_data_individual(data_curie, label=None, ind_type=None)

Add data individual :param data_curie: str either curie formatted or long string,

long strings will be converted to bnodes
Parameters:
  • type – str curie
  • label – str
Returns:

None

add_evidence(evidence_line, ev_type=None, label=None)

Add line of evidence node to association id

Parameters:
  • assoc_id – curie or iri, association id
  • evidence_line – curie or iri, evidence line
Returns:

None

add_source(evidence_line, source, label=None, src_type=None)

Applies the triples: <evidence> <dc:source> <source> <source> <rdf:type> <type> <source> <rdfs:label> “label”

TODO this should belong in a higher level class :param evidence_line: str curie :param source: str source as curie :param label: optional, str type as curie :param type: optional, str type as curie :return: None

add_supporting_data(evidence_line, measurement_dict)

Add supporting data :param evidence_line: :param data_object: dict, where keys are curies or iris and values are measurement values for example:

{
“_:1234” : “1.53E07” “_:4567”: “20.25”

}

Note: assumes measurements are RDF:Type ‘ed elsewhere :return: None

add_supporting_evidence(evidence_line, type=None, label=None)

Add supporting line of evidence node to association id

Parameters:
  • assoc_id – curie or iri, association id
  • evidence_line – curie or iri, evidence line
Returns:

None

add_supporting_publication(evidence_line, publication, label=None, pub_type=None)

<evidence> <SEPIO:0000124> <source> <source> <rdf:type> <type> <source> <rdfs:label> “label” :param evidence_line: str curie :param publication: str curie :param label: optional, str type as curie :param type: optional, str type as curie :return:

data_property = {'has_measurement': 'IAO:0000004', 'has_value': 'STATO:0000129'}
data_types = {'count': 'SIO:000794', 'odds_ratio': 'STATO:0000182', 'proportional_reporting_ratio': 'OAE:0001563'}
evidence_types = {'assay': 'OBI:0000070', 'blood test evidence': 'ECO:0001016', 'effect_size': 'STATO:0000085', 'fold_change': 'STATO:0000169', 'measurement datum': 'IAO:0000109', 'percent_change': 'STATO:percent_change', 'pvalue': 'OBI:0000175', 'statistical_hypothesis_test': 'OBI:0000673', 'zscore': 'STATO:0000104'}
object_properties = {'has_evidence': 'SEPIO:0000006', 'has_significance': 'STATO:has_significance', 'has_supporting_data': 'SEPIO:0000084', 'has_supporting_evidence': 'SEPIO:0000007', 'has_supporting_reference': 'SEPIO:0000124', 'is_evidence_for': 'SEPIO:0000031', 'is_evidence_supported_by': 'SEPIO:000010', 'is_evidence_with_support_from': 'SEPIO:0000059', 'is_refuting_evidence_for': 'SEPIO:0000033', 'is_supporting_evidence_for': 'SEPIO:0000032', 'source': 'dc:source'}
dipper.models.Family module
class dipper.models.Family.Family(graph)

Bases: object

Model mereological/part whole relationships

Although these relations are more abstract, we often use them to model family relationships (proteins, humans, etc.) The naming of this class may change in the future to better reflect the meaning of the relations it is modeling

addMember(group_id, member_id)
addMemberOf(member_id, group_id)
object_properties = {'has_member': 'RO:0002351', 'member_of': 'RO:0002350'}
dipper.models.GenomicFeature module
class dipper.models.GenomicFeature.Feature(graph, feature_id=None, label=None, feature_type=None, description=None)

Bases: object

Dealing with genomic features here. By default they are all faldo:Regions. We use SO for typing genomic features. At the moment, RO:has_subsequence is the default relationship between the regions, but this should be tested/verified.

TODO: the graph additions are in the addXToFeature functions, but should be separated. TODO: this will need to be extended to properly deal with fuzzy positions in faldo.

addFeatureEndLocation(coordinate, reference_id, strand=None, position_types=None)

Adds the coordinate details for the end of this feature :param coordinate: :param reference_id: :param strand:

Returns:
addFeatureProperty(property_type, property)
addFeatureStartLocation(coordinate, reference_id, strand=None, position_types=None)

Adds coordinate details for the start of this feature. :param coordinate: :param reference_id: :param strand: :param position_types:

Returns:
addFeatureToGraph(add_region=True, region_id=None, feature_as_class=False)

We make the assumption here that all features are instances. The features are located on a region, which begins and ends with faldo:Position The feature locations leverage the Faldo model, which has a general structure like: Triples: feature_id a feature_type (individual) faldo:location region_id region_id a faldo:region faldo:begin start_position faldo:end end_position start_position a (any of: faldo:(((Both|Plus|Minus)Strand)|Exact)Position) faldo:position Integer(numeric position) faldo:reference reference_id end_position a (any of: faldo:(((Both|Plus|Minus)Strand)|Exact)Position) faldo:position Integer(numeric position) faldo:reference reference_id

Parameters:graph
Returns:
addPositionToGraph(reference_id, position, position_types=None, strand=None)

Add the positional information to the graph, following the faldo model. We assume that if the strand is None, we give it a generic “Position” only. Triples: my_position a (any of: faldo:(((Both|Plus|Minus)Strand)|Exact)Position) faldo:position Integer(numeric position) faldo:reference reference_id

Parameters:
  • graph
  • reference_id
  • position
  • position_types
  • strand
Returns:

Identifier of the position created

addRegionPositionToGraph(region_id, begin_position_id, end_position_id)
addSubsequenceOfFeature(parentid)

This will add reciprocal triples like: feature is_subsequence_of parent parent has_subsequence feature :param graph: :param parentid:

Returns:
addTaxonToFeature(taxonid)

Given the taxon id, this will add the following triple: feature in_taxon taxonid :param graph: :param taxonid: :return:

annotation_properties = {}
data_properties = {'position': 'faldo:position'}
object_properties = {'begin': 'faldo:begin', 'downstream_of_sequence_of': 'RO:0002529', 'end': 'faldo:end', 'gene_product_of': 'RO:0002204', 'has_gene_product': 'RO:0002205', 'has_staining_intensity': 'GENO:0000207', 'has_subsequence': 'RO:0002524', 'is_about': 'IAO:0000136', 'is_subsequence_of': 'RO:0002525', 'location': 'faldo:location', 'reference': 'faldo:reference', 'upstream_of_sequence_of': 'RO:0002528'}
properties = {'begin': 'faldo:begin', 'downstream_of_sequence_of': 'RO:0002529', 'end': 'faldo:end', 'gene_product_of': 'RO:0002204', 'has_gene_product': 'RO:0002205', 'has_staining_intensity': 'GENO:0000207', 'has_subsequence': 'RO:0002524', 'is_about': 'IAO:0000136', 'is_subsequence_of': 'RO:0002525', 'location': 'faldo:location', 'position': 'faldo:position', 'reference': 'faldo:reference', 'upstream_of_sequence_of': 'RO:0002528'}
types = {'FuzzyPosition': 'faldo:FuzzyPosition', 'Position': 'faldo:Position', 'SNP': 'SO:0000694', 'assembly_component': 'SO:0000143', 'band_intensity': 'GENO:0000618', 'both_strand': 'faldo:BothStrandPosition', 'centromere': 'SO:0000577', 'chromosome': 'SO:0000340', 'chromosome_arm': 'SO:0000105', 'chromosome_band': 'SO:0000341', 'chromosome_part': 'SO:0000830', 'chromosome_region': 'GENO:0000614', 'chromosome_subband': 'GENO:0000616', 'genome': 'SO:0001026', 'gneg': 'GENO:0000620', 'gpos': 'GENO:0000619', 'gpos100': 'GENO:0000622', 'gpos25': 'GENO:0000625', 'gpos33': 'GENO:0000633', 'gpos50': 'GENO:0000624', 'gpos66': 'GENO:0000632', 'gpos75': 'GENO:0000623', 'gvar': 'GENO:0000621', 'haplotype': 'GENO:0000871', 'long_chromosome_arm': 'GENO:0000629', 'minus_strand': 'faldo:MinusStrandPosition', 'plus_strand': 'faldo:PlusStrandPosition', 'reference_genome': 'SO:0001505', 'region': 'faldo:Region', 'score': 'SO:0001685', 'short_chromosome_arm': 'GENO:0000628'}
dipper.models.GenomicFeature.makeChromID(chrom, reference=None, prefix=None)

This will take a chromosome number and a NCBI taxon number, and create a unique identifier for the chromosome. These identifiers are made in the @base space like: Homo sapiens (9606) chr1 ==> :9606chr1 Mus musculus (10090) chrX ==> :10090chrX

Parameters:
  • chrom – the chromosome (preferably without any chr prefix)
  • reference – the numeric portion of the taxon id
Returns:

dipper.models.GenomicFeature.makeChromLabel(chrom, reference=None)
dipper.models.Genotype module
class dipper.models.Genotype.Genotype(graph)

Bases: object

These methods provide convenient methods to add items related to a genotype and it’s parts to a supplied graph. They follow the patterns set out in GENO https://github.com/monarch-initiative/GENO-ontology. For specific sequence features, we use the GenomicFeature class to create them.

addAffectedLocus(allele_id, gene_id, rel_id=None)

We make the assumption here that if the relationship is not provided, it is a GENO:is_sequence_variant_instance_of.

Here, the allele should be a variant_locus, not a sequence alteration. :param allele_id: :param gene_id: :param rel_id: :return:

addAllele(allele_id, allele_label, allele_type=None, allele_description=None)

Make an allele object. If no allele_type is added, it will default to a geno:allele :param allele_id: curie for allele (required) :param allele_label: label for allele (required) :param allele_type: id for an allele type (optional, recommended SO or GENO class) :param allele_description: a free-text description of the allele :return:

addAlleleOfGene(allele_id, gene_id, rel_id=None)

We make the assumption here that if the relationship is not provided, it is a GENO:is_sequence_variant_instance_of.

Here, the allele should be a variant_locus, not a sequence alteration. :param allele_id: :param gene_id: :param rel_id: :return:

addChromosome(chr, tax_id, tax_label=None, build_id=None, build_label=None)

if it’s just the chromosome, add it as an instance of a SO:chromosome, and add it to the genome. If a build is included, punn the chromosome as a subclass of SO:chromsome, and make the build-specific chromosome an instance of the supplied chr. The chr then becomes part of the build or genome.

addChromosomeClass(chrom_num, taxon_id, taxon_label)
addChromosomeInstance(chr_num, reference_id, reference_label, chr_type=None)

Add the supplied chromosome as an instance within the given reference :param chr_num: :param reference_id: for example, a build id like UCSC:hg19 :param reference_label: :param chr_type: this is the class that this is an instance of. typically a genome-specific chr

Returns:
addConstruct(construct_id, construct_label, construct_type=None, construct_description=None)
addDerivesFrom(child_id, parent_id)

We add a derives_from relationship between the child and parent id. Examples of uses include between: an allele and a construct or strain here, a cell line and it’s parent genotype. Adding the parent and child to the graph should happen outside of this function call to ensure graph integrity. :param child_id: :param parent_id: :return:

addGene(gene_id, gene_label, gene_type=None, gene_description=None)
addGeneProduct(sequence_id, product_id, product_label=None, product_type=None)

Add gene/variant/allele has_gene_product relationship Can be used to either describe a gene to transcript relationship or gene to protein :param sequence_id: :param product_id: :param product_label: :param product_type: :return:

addGeneTargetingReagent(reagent_id, reagent_label, reagent_type, gene_id, description=None)

Here, a gene-targeting reagent is added. The actual targets of this reagent should be added separately. :param reagent_id: :param reagent_label: :param reagent_type:

Returns:
addGeneTargetingReagentToGenotype(reagent_id, genotype_id)
addGenome(taxon_id, taxon_label=None)
addGenomicBackground(background_id, background_label, background_type=None, background_description=None)
addGenomicBackgroundToGenotype(background_id, genotype_id, background_type=None)
addGenotype(genotype_id, genotype_label, genotype_type=None, genotype_description=None)

If a genotype_type is not supplied, we will default to ‘intrinsic_genotype’ :param genotype_id: :param genotype_label: :param genotype_type: :param genotype_description: :return:

addMemberOfPopulation(member_id, population_id)
addParts(part_id, parent_id, part_relationship=None)

This will add a has_part (or subproperty) relationship between a parent_id and the supplied part. By default the relationship will be BFO:has_part, but any relationship could be given here. :param part_id: :param parent_id: :param part_relationship: :return:

addPartsToVSLC(vslc_id, allele1_id, allele2_id, zygosity_id=None, allele1_rel=None, allele2_rel=None)

Here we add the parts to the VSLC. While traditionally alleles (reference or variant loci) are traditionally added, you can add any node (such as sequence_alterations for unlocated variations) to a vslc if they are known to be paired. However, if a sequence_alteration’s loci is unknown, it probably should be added directly to the GVC. :param vslc_id: :param allele1_id: :param allele2_id: :param zygosity_id: :param allele1_rel: :param allele2_rel: :return:

addPolypeptide(polypeptide_id, polypeptide_label=None, transcript_id=None, polypeptide_type=None)
Parameters:
  • polypeptide_id
  • polypeptide_label
  • polypeptide_type
  • transcript_id
Returns:

addReagentTargetedGene(reagent_id, gene_id, targeted_gene_id=None, targeted_gene_label=None, description=None)

This will create the instance of a gene that is targeted by a molecular reagent (such as a morpholino or rnai). If an instance id is not supplied, we will create it as an anonymous individual which is of the type GENO:reagent_targeted_gene. We will also add the targets relationship between the reagent and gene class.

<targeted_gene_id> a GENO:reagent_targeted_gene rdf:label targeted_gene_label dc:description description <reagent_id> GENO:targets_instance_of <gene_id>

Parameters:
  • reagent_id
  • gene_id
  • targeted_gene_id
Returns:

addReferenceGenome(build_id, build_label, taxon_id)
addSequenceAlteration(sa_id, sa_label, sa_type=None, sa_description=None)
addSequenceAlterationToVariantLocus(sa_id, vl_id)
addSequenceDerivesFrom(child_id, parent_id)
addTargetedGeneComplement(tgc_id, tgc_label, tgc_type=None, tgc_description=None)
addTargetedGeneSubregion(tgs_id, tgs_label, tgs_type=None, tgs_description=None)
addTaxon(taxon_id, genopart_id)

The supplied geno part will have the specified taxon added with RO:in_taxon relation. Generally the taxon is associated with a genomic_background, but could be added to any genotype part (including a gene, regulatory element, or sequence alteration). :param taxon_id: :param genopart_id:

Returns:
addVSLCtoParent(vslc_id, parent_id)

The VSLC can either be added to a genotype or to a GVC. The vslc is added as a part of the parent. :param vslc_id: :param parent_id: :return:

annotation_properties = {'altered_nucleotide': 'GENO:altered_nucleotide', 'reference_amino_acid': 'GENO:reference_amino_acid', 'reference_nucleotide': 'GENO:reference_nucleotide', 'results_in_amino_acid_change': 'GENO:results_in_amino_acid_change'}
genoparts = {'QTL': 'SO:0000771', 'RNAi_reagent': 'SO:0000337', 'allele': 'GENO:0000512', 'biological_region': 'SO:0001411', 'cDNA': 'SO:0000756', 'coding_transgene_feature': 'GENO:0000638', 'cytogenetic marker': 'SO:0000341', 'deletion': 'SO:0000159', 'duplication': 'SO:1000035', 'effective_genotype': 'GENO:0000525', 'extrinsic_genotype': 'GENO:0000524', 'family': 'PCO:0000020', 'female_genotype': 'GENO:0000647', 'gene': 'SO:0000704', 'genomic_background': 'GENO:0000611', 'genomic_variation_complement': 'GENO:0000009', 'heritable_phenotypic_marker': 'SO:0001500', 'insertion': 'SO:0000667', 'intrinsic_genotype': 'GENO:0000000', 'inversion': 'SO:1000036', 'karyotype_variation_complement': 'GENO:0000644', 'male_genotype': 'GENO:0000646', 'missense_variant': 'SO:0001583', 'ncRNA_gene': 'SO:0001263', 'point_mutation': 'SO:1000008', 'polypeptide': 'SO:0000104', 'population': 'PCO:0000001', 'protein_coding_gene': 'SO:0001217', 'pseudogene': 'SO:0000336', 'reagent_targeted_gene': 'GENO:0000504', 'reference_locus': 'GENO:0000036', 'regulatory_transgene_feature': 'GENO:0000637', 'sequence_alteration': 'SO:0001059', 'sequence_feature': 'SO:0000110', 'sequence_variant_affecting_polypeptide_function': 'SO:1000117', 'sequence_variant_causing_gain_of_function_of_polypeptide': 'SO:1000125', 'sequence_variant_causing_inactive_catalytic_site': 'SO:1000120', 'sequence_variant_causing_loss_of_function_of_polypeptide': 'SO:1000118', 'sex_qualified_genotype': 'GENO:0000645', 'substitution': 'SO:1000002', 'tandem_duplication': 'SO:1000173', 'targeted_gene_complement': 'GENO:0000527', 'targeted_gene_subregion': 'GENO:0000534', 'transcript': 'SO:0000233', 'transgene': 'SO:0000902', 'transgenic_insertion': 'SO:0001218', 'translocation': 'SO:0000199', 'unspecified_genomic_background': 'GENO:0000649', 'variant_locus': 'GENO:0000002', 'variant_single_locus_complement': 'GENO:0000030', 'wildtype': 'GENO:0000511'}
makeGenomeID(taxon_id)
make_experimental_model_with_genotype(genotype_id, genotype_label, taxon_id, taxon_label)
make_variant_locus_label(gene_label, allele_label)
make_vslc_label(gene_label, allele1_label, allele2_label)

Make a Variant Single Locus Complement (VSLC) in monarch-style. :param gene_label: :param allele1_label: :param allele2_label: :return:

object_properties = {'derives_from': 'RO:0001000', 'derives_sequence_from_gene': 'GENO:0000639', 'has_affected_locus': 'GENO:0000418', 'has_alternate_part': 'GENO:0000382', 'has_gene_product': 'RO:0002205', 'has_genotype': 'GENO:0000222', 'has_member_with_allelotype': 'GENO:0000225', 'has_part': 'BFO:0000051', 'has_phenotype': 'RO:0002200', 'has_reference_part': 'GENO:0000385', 'has_sex_agnostic_genotype_part': 'GENO:0000650', 'has_variant_part': 'GENO:0000382', 'has_zygosity': 'GENO:0000608', 'in_taxon': 'RO:0002162', 'is_allelotype_of': 'GENO:0000206', 'is_mutant_of': 'GENO:0000440', 'is_reference_instance_of': 'GENO:0000610', 'is_sequence_variant_instance_of': 'GENO:0000408', 'is_targeted_expression_variant_of': 'GENO:0000443', 'is_transgene_variant_of': 'GENO:0000444', 'targeted_by': 'GENO:0000634', 'targets_instance_of': 'GENO:0000414', 'translates_to': 'RO:0002513'}
properties = {'altered_nucleotide': 'GENO:altered_nucleotide', 'derives_from': 'RO:0001000', 'derives_sequence_from_gene': 'GENO:0000639', 'has_affected_locus': 'GENO:0000418', 'has_alternate_part': 'GENO:0000382', 'has_gene_product': 'RO:0002205', 'has_genotype': 'GENO:0000222', 'has_member_with_allelotype': 'GENO:0000225', 'has_part': 'BFO:0000051', 'has_phenotype': 'RO:0002200', 'has_reference_part': 'GENO:0000385', 'has_sex_agnostic_genotype_part': 'GENO:0000650', 'has_variant_part': 'GENO:0000382', 'has_zygosity': 'GENO:0000608', 'in_taxon': 'RO:0002162', 'is_allelotype_of': 'GENO:0000206', 'is_mutant_of': 'GENO:0000440', 'is_reference_instance_of': 'GENO:0000610', 'is_sequence_variant_instance_of': 'GENO:0000408', 'is_targeted_expression_variant_of': 'GENO:0000443', 'is_transgene_variant_of': 'GENO:0000444', 'reference_amino_acid': 'GENO:reference_amino_acid', 'reference_nucleotide': 'GENO:reference_nucleotide', 'results_in_amino_acid_change': 'GENO:results_in_amino_acid_change', 'targeted_by': 'GENO:0000634', 'targets_instance_of': 'GENO:0000414', 'translates_to': 'RO:0002513'}
zygosity = {'complex_heterozygous': 'GENO:0000402', 'hemizygous': 'GENO:0000606', 'hemizygous-x': 'GENO:0000605', 'hemizygous-y': 'GENO:0000604', 'heteroplasmic': 'GENO:0000603', 'heterozygous': 'GENO:0000135', 'homoplasmic': 'GENO:0000602', 'homozygous': 'GENO:0000136', 'indeterminate': 'GENO:0000137', 'simple_heterozygous': 'GENO:0000458'}
dipper.models.Model module
class dipper.models.Model.Model(graph)

Bases: object

Utility class to add common triples to a graph (subClassOf, type, label, sameAs)

addBlankNodeAnnotation(node_id)

Add an annotation property to the given `node_id` to be a pseudo blank node. This is a monarchism. :param node_id: :return:

addClassToGraph(class_id, label=None, class_type=None, description=None)

Any node added to the graph will get at least 3 triples: *(node, type, owl:Class) and *(node, label, literal(label)) *if a type is added,

then the node will be an OWL:subclassOf that the type
*if a description is provided,
it will also get added as a dc:description
Parameters:
  • class_id
  • label
  • class_type
  • description
Returns:

addComment(subject_id, comment)
addDefinition(class_id, definition)
addDepiction(subject_id, image_url)
addDeprecatedClass(old_id, new_ids=None)

Will mark the oldid as a deprecated class. if one newid is supplied, it will mark it as replaced by. if >1 newid is supplied, it will mark it with consider properties :param old_id: str - the class id to deprecate :param new_ids: list - the class list that is

the replacement(s) of the old class. Not required.
Returns:None
addDeprecatedIndividual(old_id, new_ids=None)

Will mark the oldid as a deprecated individual. if one newid is supplied, it will mark it as replaced by. if >1 newid is supplied, it will mark it with consider properties :param g: :param oldid: the individual id to deprecate :param newids: the individual idlist that is the replacement(s) of

the old individual. Not required.
Returns:
addDescription(subject_id, description)
addEquivalentClass(sub, obj)
addIndividualToGraph(ind_id, label, ind_type=None, description=None)
addLabel(subject_id, label)
addOWLPropertyClassRestriction(class_id, property_id, property_value)
addOWLVersionIRI(ontology_id, version_iri)
addOWLVersionInfo(ontology_id, version_info)
addOntologyDeclaration(ontology_id)
addPerson(person_id, person_label=None)
addSameIndividual(sub, obj)
addSubClass(child_id, parent_id)
addSynonym(class_id, synonym, synonym_type=None)

Add the synonym as a property of the class cid. Assume it is an exact synonym, unless otherwise specified :param g: :param cid: class id :param synonym: the literal synonym label :param synonym_type: the CURIE of the synonym type (not the URI) :return:

addTriple(subject_id, predicate_id, obj, object_is_literal=False, literal_type=None)
addType(subject_id, subject_type)
addXref(class_id, xref_id, xref_as_literal=False)
annotation_properties = {'clique_leader': 'MONARCH:cliqueLeader', 'comment': 'dc:comment', 'consider': 'OIO:consider', 'definition': 'IAO:0000115', 'depiction': 'foaf:depiction', 'description': 'dc:description', 'hasExactSynonym': 'OIO:hasExactSynonym', 'hasRelatedSynonym': 'OIO:hasRelatedSynonym', 'has_xref': 'OIO:hasDbXref', 'inchi_key': 'CHEBI:InChIKey', 'is_anonymous': 'MONARCH:anonymous', 'label': 'rdfs:label', 'replaced_by': 'IAO:0100001', 'version_info': 'owl:versionInfo'}
datatype_properties = {'has_measurement': 'IAO:0000004', 'position': 'faldo:position'}
makeLeader(node_id)

Add an annotation property to the given `node_id` to be the clique_leader. This is a monarchism. :param node_id: :return:

object_properties = {'causally_influences': 'RO:0002566', 'causally_upstream_of_or_within': 'RO:0002418', 'contributes_to': 'RO:0002326', 'correlates_with': 'RO:0002610', 'dc:evidence': 'dc:evidence', 'dc:source': 'dc:source', 'derives_from': 'RO:0001000', 'enables': 'RO:0002327', 'ends_during': 'RO:0002093', 'ends_with': 'RO:0002230', 'equivalent_class': 'owl:equivalentClass', 'existence_ends_at': 'UBERON:existence_ends_at', 'existence_ends_during': 'RO:0002492', 'existence_starts_at': 'UBERON:existence_starts_at', 'existence_starts_during': 'RO:0002488', 'has disposition': 'RO:0000091', 'has_author': 'ERO:0000232', 'has_begin_stage_qualifier': 'GENO:0000630', 'has_end_stage_qualifier': 'GENO:0000631', 'has_environment_qualifier': 'GENO:0000580', 'has_evidence': 'RO:0002558', 'has_gene_product': 'RO:0002205', 'has_object': ':hasObject', 'has_origin': 'GENO:0000643', 'has_part': 'BFO:0000051', 'has_phenotype': 'RO:0002200', 'has_predicate': ':hasPredicate', 'has_qualifier': 'GENO:0000580', 'has_quality': 'RO:0000086', 'has_sex_specificity': ':has_sex_specificity', 'has_subject': ':hasSubject', 'in_taxon': 'RO:0002162', 'involved_in': 'RO:0002331', 'is_about': 'IAO:0000136', 'is_marker_for': 'RO:0002607', 'mentions': 'IAO:0000142', 'model_of': 'RO:0003301', 'occurs_in': 'BFO:0000066', 'on_property': 'owl:onProperty', 'part_of': 'BFO:0000050', 'same_as': 'owl:sameAs', 'some_values_from': 'owl:someValuesFrom', 'starts_during': 'RO:0002091', 'starts_with': 'RO:0002224', 'subclass_of': 'rdfs:subClassOf', 'substance_that_treats': 'RO:0002606', 'towards': 'RO:0002503', 'type': 'rdf:type', 'version_iri': 'owl:versionIRI'}
types = {'annotation_property': 'owl:AnnotationProperty', 'class': 'owl:Class', 'datatype_property': 'owl:DatatypeProperty', 'deprecated': 'owl:deprecated', 'named_individual': 'owl:NamedIndividual', 'object_property': 'owl:ObjectProperty', 'ontology': 'owl:Ontology', 'person': 'foaf:Person', 'restriction': 'owl:Restriction'}
dipper.models.Pathway module
class dipper.models.Pathway.Pathway(graph)

Bases: object

This provides convenience methods to deal with gene and protein collections in the context of pathways.

addComponentToPathway(component_id, pathway_id)

This can be used directly when the component is directly involved in the pathway. If a transforming event is performed on the component first, then the addGeneToPathway should be used instead.

Parameters:
  • pathway_id
  • component_id
Returns:

addGeneToPathway(gene_id, pathway_id)

When adding a gene to a pathway, we create an intermediate ‘gene product’ that is involved in the pathway, through a blank node.

gene_id RO:has_gene_product _gene_product _gene_product RO:involved_in pathway_id

Parameters:
  • pathway_id
  • gene_id
Returns:

addPathway(pathway_id, pathway_label, pathway_type=None, pathway_description=None)

Adds a pathway as a class. If no specific type is specified, it will default to a subclass of “GO:cellular_process” and “PW:pathway”. :param pathway_id: :param pathway_label: :param pathway_type: :param pathway_description: :return:

object_properties = {'gene_product_of': 'RO:0002204', 'has_gene_product': 'RO:0002205', 'involved_in': 'RO:0002331'}
pathway_parts = {'cellular_process': 'GO:0009987', 'gene_product': 'CHEBI:33695', 'pathway': 'PW:0000001', 'signal_transduction': 'GO:0007165'}
properties = {'gene_product_of': 'RO:0002204', 'has_gene_product': 'RO:0002205', 'involved_in': 'RO:0002331'}
dipper.models.Provenance module
class dipper.models.Provenance.Provenance(graph)

Bases: object

To model provenance as the basis for an association. This encompasses:

  • Process history leading to a claim being made, including processes through which evidence is evaluated
  • Processes through which information used as evidence is created.
Provenance metadata includes accounts of who conducted these processes,
what entities participated in them, and when/where they occurred.
add_agent_to_graph(agent_id, agent_label, agent_type=None, agent_description=None)
add_assay_to_graph(assay_id, assay_label, assay_type=None, assay_description=None)
add_assertion(assertion, agent, agent_label, date=None)

Add assertion to graph :param assertion: :param agent: :param evidence_line: :param date: :return: None

add_date_created(prov_type, date)
add_study_measure(study, measure)
add_study_parts(study, study_parts)
add_study_to_measurements(study, measurements)
object_properties = {'asserted_by': 'SEPIO:0000130', 'created_at_location': 'SEPIO:0000019', 'created_by': 'SEPIO:0000018', 'created_on': 'pav:createdOn', 'created_with_resource': 'SEPIO:0000022', 'date_created': 'SEPIO:0000021', 'has_agent': 'SEPIO:0000017', 'has_input': 'RO:0002233', 'has_participant': 'RO:0000057', 'has_provenance': 'SEPIO:0000011', 'has_supporting_study': 'SEPIO:0000085', 'is_asserted_in': 'SEPIO:0000015', 'is_assertion_supported_by': 'SEPIO:0000111', 'measures': 'SEPIO:0000114', 'output_of': 'RO:0002353', 'specified_by': 'SEPIO:0000041'}
provenance_types = {'assay': 'OBI:0000070', 'assertion': 'SEPIO:0000001', 'assertion_process': 'SEPIO:0000003', 'mixed_model': 'STATO:0000189', 'organization': 'foaf:organization', 'person': 'foaf:person', 'project': 'VIVO:Project', 'statistical_hypothesis_test': 'OBI:0000673', 'study': 'OBI:0000471', 'variant_classification_guideline': 'SEPIO:0000037', 'xref': 'OIO:hasdbxref'}
dipper.models.Reference module
class dipper.models.Reference.Reference(graph, ref_id=None, ref_type=None)

Bases: object

To model references for associations
(such as journal articles, books, etc.).
By default, references will be typed as “documents”,
unless if the type is set otherwise.
If a short_citation is set, this will be used for the individual’s label.
We may wish to subclass this later.
addAuthor(author)
addPage(subject_id, page_url)
addRefToGraph()
addTitle(subject_id, title)
annotation_properties = {'page': 'foaf:page', 'title': 'dc:title'}
ref_types = {'document': 'IAO:0000310', 'journal_article': 'IAO:0000013', 'person': 'foaf:Person', 'photograph': 'IAO:0000185', 'publication': 'IAO:0000311', 'webpage': 'SIO:000302'}
setAuthorList(author_list)
Parameters:author_list – Array of authors
Returns:
setShortCitation(citation)
setTitle(title)
setType(reference_type)
setYear(year)
dipper.sources package
Submodules
dipper.sources.AnimalQTLdb module
class dipper.sources.AnimalQTLdb.AnimalQTLdb(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

The Animal Quantitative Trait Loci (QTL) database (Animal QTLdb) is designed to house publicly all available QTL and single-nucleotide polymorphism/gene association data on livestock animal species. This includes:

  • chicken
  • horse
  • cow
  • sheep
  • rainbow trout
  • pig

While most of the phenotypes here are related to animal husbandry, production, and rearing, integration of these phenotypes with other species may lead to insight for human disease.

Here, we use the QTL genetic maps and their computed genomic locations to create associations between the QTLs and their traits. The traits come in their internal Animal Trait ontology vocabulary, which they further map to [Vertebrate Trait](http://bioportal.bioontology.org/ontologies/VT), Product Trait, and Clinical Measurement Ontology vocabularies.

Since these are only associations to broad locations, we link the traits via “is_marker_for”, since there is no specific causative nature in the association. p-values for the associations are attached to the Association objects. We default to the UCSC build for the genomic coordinates, and make equivalences.

Any genetic position ranges that are <0, we do not include here.

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'cattle_bp': {'url': 'http://www.animalgenome.org/QTLdb/tmp/QTL_Btau_4.6.gff.txt.gz', 'file': 'QTL_Btau_4.6.gff.txt.gz'}, 'cattle_cm': {'url': 'http://www.animalgenome.org/QTLdb/export/KSUI8GFHOT6/cattle_QTLdata.txt', 'file': 'cattle_QTLdata.txt'}, 'chicken_bp': {'url': 'http://www.animalgenome.org/QTLdb/tmp/QTL_GG_4.0.gff.txt.gz', 'file': 'QTL_GG_4.0.gff.txt.gz'}, 'chicken_cm': {'url': 'http://www.animalgenome.org/QTLdb/export/KSUI8GFHOT6/chicken_QTLdata.txt', 'file': 'chicken_QTLdata.txt'}, 'horse_bp': {'url': 'http://www.animalgenome.org/QTLdb/tmp/QTL_EquCab2.0.gff.txt.gz', 'file': 'QTL_EquCab2.0.gff.txt.gz'}, 'horse_cm': {'url': 'http://www.animalgenome.org/QTLdb/export/KSUI8GFHOT6/horse_QTLdata.txt', 'file': 'horse_QTLdata.txt'}, 'pig_bp': {'url': 'http://www.animalgenome.org/QTLdb/tmp/QTL_SS_10.2.gff.txt.gz', 'file': 'QTL_SS_10.2.gff.txt.gz'}, 'pig_cm': {'url': 'http://www.animalgenome.org/QTLdb/export/KSUI8GFHOT6/pig_QTLdata.txt', 'file': 'pig_QTLdata.txt'}, 'rainbow_trout_cm': {'url': 'http://www.animalgenome.org/QTLdb/export/KSUI8GFHOT6/rainbow_trout_QTLdata.txt', 'file': 'rainbow_trout_QTLdata.txt'}, 'sheep_bp': {'url': 'http://www.animalgenome.org/QTLdb/tmp/QTL_OAR_3.1.gff.txt.gz', 'file': 'QTL_OAR_3.1.gff.txt.gz'}, 'sheep_cm': {'url': 'http://www.animalgenome.org/QTLdb/export/KSUI8GFHOT6/sheep_QTLdata.txt', 'file': 'sheep_QTLdata.txt'}, 'trait_mappings': {'url': 'http://www.animalgenome.org/QTLdb/export/trait_mappings.csv', 'file': 'trait_mappings'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)
Parameters:limit
Returns:
test_ids = {1795, 28483, 32133, 1798, 29385, 29018, 31023, 8945, 17138, 12532, 29016, 14234}
dipper.sources.Bgee module
class dipper.sources.Bgee.Bgee(graph_type, are_bnodes_skolemized, tax_ids=None, version=None)

Bases: dipper.sources.Source.Source

Bgee is a database to retrieve and compare gene expression patterns between animal species.

Bgee first maps heterogeneous expression data (currently RNA-Seq, Affymetrix, in situ hybridization, and EST data) to anatomy and development of different species.

Then, in order to perform automated cross species comparisons, homology relationships across anatomies, and comparison criteria between developmental stages, are designed.

BGEE_FTP = 'ftp.bgee.org'
DEFAULT_TAXA = [10090, 10116, 13616, 28377, 6239, 7227, 7955, 8364, 9031, 9258, 9544, 9593, 9597, 9598, 9606, 9823, 9913]
checkIfRemoteIsNewer(localfile, remote_size, remote_modify)

Overrides checkIfRemoteIsNewer in Source class

Parameters:
  • localfile – str file path
  • remote_size – str bytes
  • remote_modify – str last modify date in the form 20160705042714
Returns:

boolean True if remote file is newer else False

fetch(is_dl_forced=False)
Parameters:is_dl_forced – boolean, force download
Returns:
files = {'anat_entity': {'path': '/download/ranks/anat_entity/', 'pattern': re.compile('.*_all_data_.*')}}
parse(limit=None)

Given the input taxa, expects files in the raw directory with the name {tax_id}_anat_entity_all_data_Pan_troglodytes.tsv.zip

Parameters:limit – int Limit to top ranked anatomy associations per group
Returns:None
dipper.sources.BioGrid module
class dipper.sources.BioGrid.BioGrid(graph_type, are_bnodes_skolemized, tax_ids=None)

Bases: dipper.sources.Source.Source

Biogrid interaction data

biogrid_ids = [106638, 107308, 107506, 107674, 107675, 108277, 108506, 108767, 108814, 108899, 110308, 110364, 110678, 111642, 112300, 112365, 112771, 112898, 199832, 203220, 247276, 120150, 120160, 124085]
fetch(is_dl_forced=False)
Parameters:is_dl_forced
Returns:None
files = {'identifiers': {'url': 'http://thebiogrid.org/downloads/archives/Latest%20Release/BIOGRID-IDENTIFIERS-LATEST.tab.zip', 'file': 'identifiers.tab.zip'}, 'interactions': {'url': 'http://thebiogrid.org/downloads/archives/Latest%20Release/BIOGRID-ALL-LATEST.mitab.zip', 'file': 'interactions.mitab.zip'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)
Parameters:limit
Returns:
dipper.sources.CTD module
class dipper.sources.CTD.CTD(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

The Comparative Toxicogenomics Database (CTD) includes curated data describing cross-species chemical–gene/protein interactions and chemical– and gene–disease associations to illuminate molecular mechanisms underlying variable susceptibility and environmentally influenced diseases.

Here, we fetch, parse, and convert data from CTD into triples, leveraging only the associations based on DIRECT evidence (not using the inferred associations). We currently process the following associations: * chemical-disease * gene-pathway * gene-disease

CTD curates relationships between genes and chemicals/diseases with marker/mechanism and/or therapeutic. Unfortunately, we cannot disambiguate between marker (gene expression) and mechanism (causation) for these associations. Therefore, we are left to relate these simply by “marker”.

CTD also pulls in genes and pathway membership from KEGG and REACTOME. We create groups of these following the pattern that the specific pathway is a subclass of ‘cellular process’ (a go process), and the gene is “involved in” that process.

For diseases, we preferentially use OMIM identifiers when they can be used uniquely over MESH. Otherwise, we use MESH ids.

Note that we scrub the following identifiers and their associated data: * REACT:REACT_116125 - generic disease class * MESH:D004283 - dog diseases * MESH:D004195 - disease models, animal * MESH:D030342 - genetic diseases, inborn * MESH:D040181 - genetic dieases, x-linked * MESH:D020022 - genetic predisposition to a disease

fetch(is_dl_forced=False)

Override Source.fetch() Fetches resources from CTD using the CTD.files dictionary Args: :param is_dl_forced (bool): Force download Returns: :return None

files = {'chemical_disease_interactions': {'url': 'http://ctdbase.org/reports/CTD_chemicals_diseases.tsv.gz', 'file': 'CTD_chemicals_diseases.tsv.gz'}, 'gene_disease': {'url': 'http://ctdbase.org/reports/CTD_genes_diseases.tsv.gz', 'file': 'CTD_genes_diseases.tsv.gz'}, 'gene_pathway': {'url': 'http://ctdbase.org/reports/CTD_genes_pathways.tsv.gz', 'file': 'CTD_genes_pathways.tsv.gz'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)

Override Source.parse() Parses version and interaction information from CTD Args: :param limit (int, optional) limit the number of rows processed Returns: :return None

static_files = {'publications': {'file': 'CTD_curated_references.tsv'}}
dipper.sources.ClinVar module
class dipper.sources.ClinVar.ClinVar(graph_type, are_bnodes_skolemized, tax_ids=None, gene_ids=None)

Bases: dipper.sources.Source.Source

ClinVar is a host of clinically relevant variants, both directly-submitted and curated from the literature. We process the variant_summary file here, which is a digested version of their full xml. We add all variants (and coordinates/build) from their system.

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'variant_citations': {'url': 'http://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/var_citations.txt', 'file': 'variant_citations.txt'}, 'variant_summary': {'url': 'http://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz', 'file': 'variant_summary.txt.gz'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

scrub()

The var_citations file has a bad row in it with > 6 cols. I will comment these out.

Returns:
variant_ids = [4288, 4289, 4290, 4291, 4297, 5240, 5241, 5242, 5243, 5244, 5245, 5246, 7105, 8877, 9295, 9296, 9297, 9298, 9449, 10072, 10361, 10382, 12528, 12529, 12530, 12531, 12532, 14353, 14823, 15872, 17232, 17233, 17234, 17235, 17236, 17237, 17238, 17239, 17284, 17285, 17286, 17287, 18179, 18180, 18181, 18343, 18363, 31951, 37123, 38562, 94060, 98004, 98005, 98006, 98008, 98009, 98194, 98195, 98196, 98197, 98198, 100055, 112885, 114372, 119244, 128714, 130558, 130559, 130560, 130561, 132146, 132147, 132148, 144375, 146588, 147536, 147814, 147936, 152976, 156327, 161457, 162000, 167132]
dipper.sources.ClinVarXML_alpha module
dipper.sources.Coriell module
class dipper.sources.Coriell.Coriell(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

The Coriell Catalog provided to Monarch includes metadata and descriptions of NIGMS, NINDS, NHGRI, and NIA cell lines. These lines are made available for research purposes. Here, we create annotations for the cell lines as models of the diseases from which they originate.

We create a handle for a patient from which the given cell line is derived (since there may be multiple cell lines created from a given patient). A genotype is assembled for a patient, which includes a karyotype (if specified) and/or a collection of variants. Both the genotype (has_genotype) and disease are linked to the patient (has_phenotype), and the cell line is listed as derived from the patient. The cell line is classified by it’s [CLO cell type](http://www.ontobee.org/browser/index.php?o=clo), which itself is linked to a tissue of origin.

Unfortunately, the omim numbers listed in this file are both for genes & diseases; we have no way of knowing a priori if a designated omim number is a gene or disease; so we presently link the patient to any omim id via the has_phenotype relationship.

Notice: The Coriell catalog is delivered to Monarch in a specific format, and requires ssh rsa fingerprint identification. Other groups wishing to get this data in it’s raw form will need to contact Coriell for credential This needs to be placed into your configuration file for it to work.

fetch(is_dl_forced=False)

Here we connect to the coriell sftp server using private connection details. They dump bi-weekly files with a timestamp in the filename. For each catalog, we poll the remote site and pull the most-recently updated file, renaming it to our local latest.csv.

Be sure to have pg user/password connection details in your conf.json file, like: dbauth : {“coriell” : { “user” : “<username>”, “password” : “<password>”, “host” : <host>, “private_key”=path/to/rsa_key} }

Parameters:is_dl_forced
Returns:
files = {'NHGRI': {'label': 'NHGRI Sample Repository for Human Genetic Research', 'page': 'https://catalog.coriell.org/1/NHGRI', 'id': 'NHGRI', 'file': 'NHGRI.csv'}, 'NIA': {'label': 'NIA Aging Cell Repository', 'page': 'https://catalog.coriell.org/1/NIA', 'id': 'NIA', 'file': 'NIA.csv'}, 'NIGMS': {'label': 'NIGMS Human Genetic Cell Repository', 'page': 'https://catalog.coriell.org/1/NIGMS', 'id': 'NIGMS', 'file': 'NIGMS.csv'}, 'NINDS': {'label': 'NINDS Human Genetics DNA and Cell line Repository', 'page': 'https://catalog.coriell.org/1/NINDS', 'id': 'NINDS', 'file': 'NINDS.csv'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

terms = {'age': 'EFO:0000246', 'cell_line_repository': 'CLO:0000008', 'collection': 'ERO:0002190', 'ethnic_group': 'EFO:0001799', 'race': 'SIO:001015', 'sampling_time': 'EFO:0000689'}
test_lines = ['ND02380', 'ND02381', 'ND02383', 'ND02384', 'GM17897', 'GM17898', 'GM17896', 'GM17944', 'GM17945', 'ND00055', 'ND00094', 'ND00136', 'GM17940', 'GM17939', 'GM20567', 'AG02506', 'AG04407', 'AG07602AG07601', 'GM19700', 'GM19701', 'GM19702', 'GM00324', 'GM00325', 'GM00142', 'NA17944', 'AG02505', 'GM01602', 'GM02455', 'AG00364', 'GM13707', 'AG00780']
dipper.sources.Decipher module
class dipper.sources.Decipher.Decipher(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

The Decipher group curates and assembles the Development Disorder Genotype Phenotype Database (DDG2P) which is a curated list of genes reported to be associated with developmental disorders, compiled by clinicians as part of the DDD study to facilitate clinical feedback of likely causal variants.

Beware that the redistribution of this data is a bit unclear from the [license](https://decipher.sanger.ac.uk/legal). If you intend to distribute this data, be sure to have the appropriate licenses in place.

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'annot': {'url': 'https://decipher.sanger.ac.uk/files/ddd/ddg2p.zip', 'file': 'ddg2p.zip'}}
make_allele_by_consequence(consequence, gene_id, gene_symbol)

Given a “consequence” label that describes a variation type, create an anonymous variant of the specified gene as an instance of that consequence type.

Parameters:
  • consequence
  • gene_id
  • gene_symbol
Returns:

allele_id

parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

dipper.sources.EOM module
class dipper.sources.EOM.EOM(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.PostgreSQLSource.PostgreSQLSource

Elements of Morphology is a resource from NHGRI that has definitions of morphological abnormalities, together with image depictions. We pull those relationships, as well as our local mapping of equivalences between EOM and HP terminologies.

The website is crawled monthly by NIF’s DISCO crawler system, which we utilize here. Be sure to have pg user/password connection details in your conf.json file, like: dbauth : {‘disco’ : {‘user’ : ‘<username>’, ‘password’ : ‘<password>’}}

Monarch-curated data for the HP to EOM mapping is stored at https://raw.githubusercontent.com/obophenotype/human-phenotype-ontology/master/src/mappings/hp-to-eom-mapping.tsv

Since this resource is so small, the entirety of it is the “test” set.

fetch(is_dl_forced=False)

create the connection details for DISCO

files = {'map': {'url': 'https://raw.githubusercontent.com/obophenotype/human-phenotype-ontology/master/src/mappings/hp-to-eom-mapping.tsv', 'file': 'hp-to-eom-mapping.tsv'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)

Over ride Source.parse inherited via PostgreSQLSource

tables = ['dvp.pr_nlx_157874_1']
dipper.sources.Ensembl module
class dipper.sources.Ensembl.Ensembl(graph_type, are_bnodes_skolemized, tax_ids=None, gene_ids=None)

Bases: dipper.sources.Source.Source

This is the processing module for Ensembl.

It only includes methods to acquire the equivalences between NCBIGene and ENSG ids using ENSEMBL’s Biomart services.

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

fetch_protein_gene_map(taxon_id)

Fetch a list of proteins for a species in biomart :param taxid: :return: dict

fetch_protein_list(taxon_id)

Fetch a list of proteins for a species in biomart :param taxid: :return: list

fetch_uniprot_gene_map(taxon_id)

Fetch a dict of uniprot-gene for a species in biomart :param taxid: :return: dict

files = {'10090': {'file': 'ensembl_10090.txt'}, '10116': {'file': 'ensembl_10116.txt'}, '13616': {'file': 'ensembl_13616.txt'}, '28377': {'file': 'ensembl_28377.txt'}, '31033': {'file': 'ensembl_31033.txt'}, '3702': {'file': 'ensembl_3702.txt'}, '44689': {'file': 'ensembl_44689.txt'}, '4896': {'file': 'ensembl_4896.txt'}, '4932': {'file': 'ensembl_4932.txt'}, '6239': {'file': 'ensembl_6239.txt'}, '7227': {'file': 'ensembl_7227.txt'}, '7955': {'file': 'ensembl_7955.txt'}, '8364': {'file': 'ensembl_8364.txt'}, '9031': {'file': 'ensembl_9031.txt'}, '9258': {'file': 'ensembl_9258.txt'}, '9544': {'file': 'ensembl_9544.txt'}, '9606': {'file': 'ensembl_9606.txt'}, '9615': {'file': 'ensembl_9615.txt'}, '9796': {'file': 'ensembl_9796.txt'}, '9823': {'file': 'ensembl_9823.txt'}, '9913': {'file': 'ensembl_9913.txt'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

dipper.sources.FlyBase module
class dipper.sources.FlyBase.FlyBase(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.PostgreSQLSource.PostgreSQLSource

This is the [Drosophila Genetics](http://www.flybase.org/) resource, from which we process genotype and phenotype data about fruitfly. Genotypes leverage the GENO genotype model.

Here, we connect to their public database, and download a subset of tables/views to get specifically at the geno-pheno data, then iterate over the tables. We end up effectively performing joins when adding nodes to the graph. We connect using the [Direct Chado Access](http://gmod.org/wiki/Public_Chado_Databases#Direct_Chado_Access)

When running the whole set, it performs best by dumping raw triples using the flag `--format nt`.

fetch(is_dl_forced=False)
Returns:
files = {'disease_models': {'url': 'ftp://ftp.flybase.net/releases/current/precomputed_files/human_disease/allele_human_disease_model_data_fb_*.tsv.gz', 'file': 'allele_human_disease_model_data.tsv.gz'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)

We process each of the postgres tables in turn. The order of processing is important here, as we build up a hashmap of internal vs external identifers (unique keys by type to FB id). These include allele, marker (gene), publication, strain, genotype, annotation (association), and descriptive notes. :param limit: Only parse this many lines of each table :return:

querys = {'feature': "\n SELECT feature_id, dbxref_id, organism_id, name, uniquename,\n null as residues, seqlen, md5checksum, type_id, is_analysis,\n timeaccessioned, timelastmodified\n FROM feature WHERE is_analysis = false and is_obsolete = 'f'\n ", 'feature_dbxref_WIP': ' -- 17M rows in ~2 minutes\n SELECT\n feature.name feature_name, feature.uniquename feature_id,\n organism.abbreviation abbrev, organism.genus, organism.species,\n cvterm.name frature_type, db.name db, dbxref.accession\n FROM feature_dbxref\n JOIN dbxref ON feature_dbxref.dbxref_id = dbxref.dbxref_id\n JOIN db ON dbxref.db_id = db.db_id\n JOIN feature ON feature_dbxref.feature_id = feature.feature_id\n JOIN organism ON feature.organism_id = organism.organism_id\n JOIN cvterm ON feature.type_id = cvterm.cvterm_id\n WHERE feature_dbxref.is_current = true\n AND feature.is_analysis = false\n AND feature.is_obsolete = false\n AND cvterm.is_obsolete = 0\n ;\n '}
resources = [{'outfile': 'feature_relationship', 'query': '../../resources/sql/fb/feature_relationship.sql'}, {'outfile': 'stockprop', 'query': '../../resources/sql/fb/stockprop.sql'}]
tables = ['genotype', 'feature_genotype', 'pub', 'feature_pub', 'pub_dbxref', 'feature_dbxref', 'cvterm', 'stock_genotype', 'stock', 'organism', 'organism_dbxref', 'environment', 'phenotype', 'phenstatement', 'dbxref', 'phenotype_cvterm', 'phendesc', 'environment_cvterm']
test_keys = {'allele': [29677937, 23174110, 23230960, 23123654, 23124718, 23146222, 29677936, 23174703, 11384915, 11397966, 53333044, 23189969, 3206803, 29677937, 29677934, 23256689, 23213050, 23230614, 23274987, 53323093, 40362726, 11380755, 11380754, 23121027, 44425218, 28298666], 'annot': [437783, 437784, 437785, 437786, 437789, 437796, 459885, 436779, 436780, 479826], 'feature': [11411407, 53361578, 53323094, 40377849, 40362727, 11379415, 61115970, 11380753, 44425219, 44426878, 44425220], 'gene': [23220066, 10344219, 58107328, 3132660, 23193483, 3118401, 3128715, 3128888, 23232298, 23294450, 3128626, 23255338, 8350351, 41994592, 3128715, 3128432, 3128840, 3128650, 3128654, 3128602, 3165464, 23235262, 3165510, 3153563, 23225695, 54564652, 3111381, 3111324], 'genotype': [267393, 267400, 130147, 168516, 111147, 200899, 46696, 328131, 328132, 328134, 328136, 381024, 267411, 327436, 197293, 373125, 361163, 403038], 'notes': [], 'organism': [1, 226, 456], 'pub': [359867, 327373, 153054, 153620, 370777, 154315, 345909, 365672, 366057, 11380753], 'strain': [8117, 3649, 64034, 213, 30131]}
dipper.sources.GWASCatalog module
class dipper.sources.GWASCatalog.GWASCatalog(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

The NHGRI-EBI Catalog of published genome-wide association studies.

We link the variants recorded here to the curated EFO-classes using a “contributes_to” linkage because the only thing we know is that the SNPs are associated with the trait/disease, but we don’t know if it is actually causative.

Description of the GWAS catalog is here: http://www.ebi.ac.uk/gwas/docs/fileheaders#_file_headers_for_catalog_version_1_0_1

GWAS also pulishes Owl files described here http://www.ebi.ac.uk/gwas/docs/ontology

Status: IN PROGRESS

GWASFILE = 'gwas-catalog-associations_ontology-annotated.tsv'
GWASFTP = 'ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/latest'
fetch(is_dl_forced=False)
Parameters:is_dl_forced
Returns:
files = {'catalog': {'url': 'ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/latest/gwas-catalog-associations_ontology-annotated.tsv', 'file': 'gwas-catalog-associations_ontology-annotated.tsv'}, 'efo': {'url': 'http://www.ebi.ac.uk/efo/efo.owl', 'file': 'efo.owl'}, 'so': {'url': 'http://purl.obolibrary.org/obo/so.owl', 'file': 'so.owl'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

process_catalog(limit=None)
Parameters:limit
Returns:
terms = {'age': 'EFO:0000246', 'cell_line_repository': 'CLO:0000008', 'collection': 'ERO:0002190', 'ethnic_group': 'EFO:0001799', 'race': 'SIO:001015', 'sampling_time': 'EFO:0000689'}
dipper.sources.GeneOntology module
class dipper.sources.GeneOntology.GeneOntology(graph_type, are_bnodes_skolemized, tax_ids=None)

Bases: dipper.sources.Source.Source

This is the parser for the [Gene Ontology Annotations](http://www.geneontology.org), from which we process gene-process/function/subcellular location associations.

We generate the GO graph to include the following information: * genes * gene-process * gene-function * gene-location

We process only a subset of the organisms:

Status: IN PROGRESS / INCOMPLETE

clean_db_prefix(db)

Here, we map the GO-style prefixes with Monarch-style prefixes that are able to be processed by our curie_map. :param db: :return:

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'10090': {'url': 'http://geneontology.org/gene-associations/gene_association.mgi.gz', 'file': 'gene_association.mgi.gz'}, '10116': {'url': 'http://geneontology.org/gene-associations/gene_association.rgd.gz', 'file': 'gene_association.rgd.gz'}, '4896': {'url': 'http://geneontology.org/gene-associations/gene_association.pombase.gz', 'file': 'gene_association.pombase.gz'}, '559292': {'url': 'http://geneontology.org/gene-associations/gene_association.sgd.gz', 'file': 'gene_association.sgd.gz'}, '6239': {'url': 'http://geneontology.org/gene-associations/gene_association.wb.gz', 'file': 'gene_association.wb.gz'}, '7227': {'url': 'http://geneontology.org/gene-associations/gene_association.fb.gz', 'file': 'gene_association.fb.gz'}, '7955': {'url': 'http://geneontology.org/gene-associations/gene_association.zfin.gz', 'file': 'gene_association.zfin.gz'}, '9031': {'url': 'http://geneontology.org/gene-associations/goa_chicken.gaf.gz', 'file': 'gene_association.goa_ref_chicken.gz'}, '9606': {'url': 'http://geneontology.org/gene-associations/goa_human.gaf.gz', 'file': 'gene_association.goa_ref_human.gz'}, '9615': {'url': 'http://geneontology.org/gene-associations/goa_dog.gaf.gz', 'file': 'gene_association.goa_dog.gz'}, '9823': {'url': 'http://geneontology.org/gene-associations/goa_pig.gaf.gz', 'file': 'gene_association.goa_ref_pig.gz'}, '9913': {'url': 'http://geneontology.org/gene-associations/goa_cow.gaf.gz', 'file': 'goa_cow.gaf.gz'}, 'go-references': {'url': 'http://www.geneontology.org/doc/GO.references', 'file': 'GO.references'}, 'id-map': {'url': 'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz', 'file': 'idmapping_selected.tab.gz'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

get_uniprot_entrez_id_map()
map_files = {'eco_map': 'http://purl.obolibrary.org/obo/eco/gaf-eco-mapping.txt'}
parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

process_gaf(file, limit, id_map=None, eco_map=None)
dipper.sources.GeneReviews module
class dipper.sources.GeneReviews.GeneReviews(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

Here we process the GeneReviews mappings to OMIM, plus inspect the GeneReviews (html) books to pull the clinical descriptions in order to populate the definitions of the terms in the ontology. We define the GeneReviews items as classes that are either grouping classes over OMIM disease ids (gene ids are filtered out), or are made as subclasses of DOID:4 (generic disease).

Note that GeneReviews [copyright policy](http://www.ncbi.nlm.nih.gov/books/NBK138602/) (as of 2015.11.20) says:

GeneReviews® chapters are owned by the University of Washington, Seattle, © 1993-2015. Permission is hereby granted to reproduce, distribute, and translate copies of content materials provided that (i) credit for source (www.ncbi.nlm.nih.gov/books/NBK1116/) and copyright (University of Washington, Seattle) are included with each copy; (ii) a link to the original material is provided whenever the material is published elsewhere on the Web; and (iii) reproducers, distributors, and/or translators comply with this copyright notice and the GeneReviews Usage Disclaimer.

This script doesn’t pull the GeneReviews books from the NCBI Bookshelf directly; scripting this task is expressly prohibited by [NCBIBookshelf policy](http://www.ncbi.nlm.nih.gov/books/NBK45311/). However, assuming you have acquired the books (in html format) via permissible means, a parser for those books is provided here to extract the clinical descriptions to define the NBK identified classes.

create_books()
fetch(is_dl_forced=False)

We fetch GeneReviews id-label map and id-omim mapping files from NCBI. :return: None

files = {'idmap': {'url': 'http://ftp.ncbi.nih.gov/pub/GeneReviews/NBKid_shortname_OMIM.txt', 'file': 'NBKid_shortname_OMIM.txt'}, 'titles': {'url': 'http://ftp.ncbi.nih.gov/pub/GeneReviews/GRtitle_shortname_NBKid.txt', 'file': 'GRtitle_shortname_NBKid.txt'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)
Returns:None
process_nbk_html(limit)

Here we process the gene reviews books to fetch the clinical descriptions to include in the ontology. We only use books that have been acquired manually, as NCBI Bookshelf does not permit automated downloads. This parser will only process the books that are found in the `raw/genereviews/books` directory, permitting partial completion.

Parameters:limit
Returns:
dipper.sources.HGNC module
class dipper.sources.HGNC.HGNC(graph_type, are_bnodes_skolemized, tax_ids=None, gene_ids=None)

Bases: dipper.sources.Source.Source

This is the processing module for HGNC.

We create equivalences between HGNC identifiers and ENSEMBL and NCBIGene. We also add the links to cytogenic locations for the gene features.

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'genes': {'url': 'ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt', 'file': 'hgnc_complete_set.txt'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

get_symbol_id_map()

A convenience method to create a mapping between the HGNC symbols and their identifiers. :return:

parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

dipper.sources.HPOAnnotations module
class dipper.sources.HPOAnnotations.HPOAnnotations(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

The [Human Phenotype Ontology](http://human-phenotype-ontology.org) group curates and assembles over 115,000 annotations to hereditary diseases using the HPO ontology. Here we create OBAN-style associations between diseases and phenotypic features, together with their evidence, and age of onset and frequency (if known). The parser currently only processes the “abnormal” annotations. Association to “remarkable normality” will be added in the near future.

We create additional associations from text mining. See info at http://pubmed-browser.human-phenotype-ontology.org/.

Also, you can read about these annotations in [PMID:26119816](http://www.ncbi.nlm.nih.gov/pubmed/26119816).

In order to properly test this class, you should have a conf.json file configured with some test ids, in the structure of: # as examples. put your favorite ids in the config. <pre> test_ids: {“disease” : [“OMIM:119600”, “OMIM:120160”]} </pre>

add_common_files_to_file_list()
eco_dict = {'ICE': 'ECO:0000305', 'IEA': 'ECO:0000501', 'ITM': 'ECO:0000246', 'PCS': 'ECO:0000269', 'TAS': 'ECO:0000304'}
fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'annot': {'url': 'http://compbio.charite.de/hudson/job/hpo.annotations/lastStableBuild/artifact/misc/phenotype_annotation.tab', 'file': 'phenotype_annotation.tab'}, 'doid': {'url': 'http://purl.obolibrary.org/obo/doid.owl', 'file': 'doid.owl'}, 'version': {'url': 'http://compbio.charite.de/hudson/job/hpo.annotations/lastStableBuild/artifact/misc/data_version.txt', 'file': 'data_version.txt'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

get_common_files()

Fetch the raw hpo-annotation-data by cloning/pulling the [repository](https://github.com/monarch-initiative/hpo-annotation-data.git) These files get added to the files object, and iterated over separately. :return:

get_doid_ids_for_unpadding()

Here, we fetch the doid owl file, and get all the doids. We figure out which are not zero-padded, so we can map the DOID to the correct identifier when processing the common annotation files.

This may become obsolete when https://github.com/monarch-initiative/hpo-annotation-data/issues/84 is addressed.

Returns:
parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

process_all_common_disease_files(limit=None)

Loop through all of the files that we previously fetched from git, creating the disease-phenotype assoc. :param limit: :return:

process_common_disease_file(raw, unpadded_doids, limit=None)

Make disaese-phenotype associations. Some identifiers need clean up: * DOIDs are listed as DOID-DOID: –> DOID: * DOIDs may be unnecessarily zero-padded. these are remapped to their non-padded equivalent.

Parameters:
  • raw
  • unpadded_doids
  • limit
Returns:

scrub()

Perform various data-scrubbing on the raw data files prior to parsing. For this resource, this currently includes: * revise errors in identifiers for some OMIM and PMIDs

Returns:None
dipper.sources.IMPC module
class dipper.sources.IMPC.IMPC(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

From the [IMPC](http://mousephenotype.org) website: The IMPC is generating a knockout mouse strain for every protein coding gene by using the embryonic stem cell resource generated by the International Knockout Mouse Consortium (IKMC). Systematic broad-based phenotyping is performed by each IMPC center using standardized procedures found within the International Mouse Phenotyping Resource of Standardised Screens (IMPReSS) resource. Gene-to-phenotype associations are made by a versioned statistical analysis with all data freely available by this web portal and by several data download features.

Here, we pull the data and model the genotypes using GENO and the genotype-to-phenotype associations using the OBAN schema.

We use all identifiers given by the IMPC with a few exceptions:

  • For identifiers that IMPC provides, but does not resolve,

we instantiate them as Blank Nodes. Examples include things with the pattern of: UROALL, EUROCURATE, NULL-*,

  • We mint three identifiers:
  1. Intrinsic genotypes not including sex, based on:
  • colony_id (ES cell line + phenotyping center)
  • strain
  • zygosity
  1. For the Effective genotypes that are attached to the phenotypes:
  • colony_id (ES cell line + phenotyping center)
  • strain
  • zygosity
  • sex

3. Associations based on: effective_genotype_id + phenotype_id + phenotyping_center + pipeline_stable_id + procedure_stable_id + parameter_stable_id

We DO NOT yet add the assays as evidence for the G2P associations here. To be added in the future.

compare_checksums()

test to see if fetched file matches checksum from ebi :return: True or False

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'all': {'url': 'ftp://ftp.ebi.ac.uk/pub/databases/impc/latest/csv/ALL_genotype_phenotype.csv.gz', 'file': 'ALL_genotype_phenotype.csv.gz'}, 'checksum': {'url': 'ftp://ftp.ebi.ac.uk/pub/databases/impc/latest/csv/checksum.md5', 'file': 'checksum.md5'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

map_files = {'impc_map': '../../resources/impc_mappings.yaml', 'impress_map': 'https://data.monarchinitiative.org/dipper/cache/impress_codes.json'}
parse(limit=None)

IMPC data is delivered in three separate csv files OR in one integrated file, each with the same file format.

Parameters:limit
Returns:
parse_checksum_file(file)

:param file :return dict

test_ids = ['MGI:109380', 'MGI:1347004', 'MGI:1353495', 'MGI:1913840', 'MGI:2144157', 'MGI:2182928', 'MGI:88456', 'MGI:96704', 'MGI:1913649', 'MGI:95639', 'MGI:1341847', 'MGI:104848', 'MGI:2442444', 'MGI:2444584', 'MGI:1916948', 'MGI:107403', 'MGI:1860086', 'MGI:1919305', 'MGI:2384936', 'MGI:88135', 'MGI:1913367', 'MGI:1916571', 'MGI:2152453', 'MGI:1098270']
dipper.sources.KEGG module
class dipper.sources.KEGG.KEGG(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'cel_orthologs': {'url': 'http://rest.kegg.jp/link/orthology/cel', 'file': 'cel_orthologs'}, 'disease': {'url': 'http://rest.genome.jp/list/disease', 'file': 'disease'}, 'disease_gene': {'url': 'http://rest.kegg.jp/link/disease/hsa', 'file': 'disease_gene'}, 'dme_orthologs': {'url': 'http://rest.kegg.jp/link/orthology/dme', 'file': 'dme_orthologs'}, 'dre_orthologs': {'url': 'http://rest.kegg.jp/link/orthology/dre', 'file': 'dre_orthologs'}, 'hsa_gene2pathway': {'url': 'http://rest.kegg.jp/link/pathway/hsa', 'file': 'human_gene2pathway'}, 'hsa_genes': {'url': 'http://rest.genome.jp/list/hsa', 'file': 'hsa_genes'}, 'hsa_orthologs': {'url': 'http://rest.kegg.jp/link/orthology/hsa', 'file': 'hsa_orthologs'}, 'mmu_orthologs': {'url': 'http://rest.kegg.jp/link/orthology/mmu', 'file': 'mmu_orthologs'}, 'ncbi': {'url': 'http://rest.genome.jp/conv/ncbi-geneid/hsa', 'file': 'ncbi'}, 'omim2disease': {'url': 'http://rest.genome.jp/link/disease/omim', 'file': 'omim2disease'}, 'omim2gene': {'url': 'http://rest.genome.jp/link/omim/hsa', 'file': 'omim2gene'}, 'ortholog_classes': {'url': 'http://rest.genome.jp/list/orthology', 'file': 'ortholog_classes'}, 'pathway': {'url': 'http://rest.genome.jp/list/pathway', 'file': 'pathway'}, 'pathway_disease': {'url': 'http://rest.kegg.jp/link/pathway/ds', 'file': 'pathway_disease'}, 'pathway_ko': {'url': 'http://rest.kegg.jp/link/pathway/ko', 'file': 'pathway_ko'}, 'pathway_pubmed': {'url': 'http://rest.kegg.jp/link/pathway/pubmed', 'file': 'pathway_pubmed'}, 'rno_orthologs': {'url': 'http://rest.kegg.jp/link/orthology/rno', 'file': 'rno_orthologs'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)
Parameters:limit
Returns:
test_ids = {'disease': ['ds:H00015', 'ds:H00026', 'ds:H00712', 'ds:H00736', 'ds:H00014'], 'genes': ['hsa:100506275', 'hsa:285958', 'hsa:286410', 'hsa:6387', 'hsa:1080', 'hsa:11200', 'hsa:1131', 'hsa:1137', 'hsa:126', 'hsa:1277', 'hsa:1278', 'hsa:1285', 'hsa:1548', 'hsa:1636', 'hsa:1639', 'hsa:183', 'hsa:185', 'hsa:1910', 'hsa:207', 'hsa:2099', 'hsa:2483', 'hsa:2539', 'hsa:2629', 'hsa:2697', 'hsa:3161', 'hsa:3845', 'hsa:4137', 'hsa:4591', 'hsa:472', 'hsa:4744', 'hsa:4835', 'hsa:4929', 'hsa:5002', 'hsa:5080', 'hsa:5245', 'hsa:5290', 'hsa:53630', 'hsa:5630', 'hsa:5663', 'hsa:580', 'hsa:5888', 'hsa:5972', 'hsa:6311', 'hsa:64327', 'hsa:6531', 'hsa:6647', 'hsa:672', 'hsa:675', 'hsa:6908', 'hsa:7040', 'hsa:7045', 'hsa:7048', 'hsa:7157', 'hsa:7251', 'hsa:7490', 'hsa:7517', 'hsa:79728', 'hsa:83893', 'hsa:83990', 'hsa:841', 'hsa:8438', 'hsa:8493', 'hsa:860', 'hsa:9568', 'hsa:9627', 'hsa:9821', 'hsa:999', 'hsa:3460'], 'orthology_classes': ['ko:K00010', 'ko:K00027', 'ko:K00042', 'ko:K00088'], 'pathway': ['path:map00010', 'path:map00195', 'path:map00100', 'path:map00340', 'path:hsa05223']}
dipper.sources.MGI module
class dipper.sources.MGI.MGI(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.PostgreSQLSource.PostgreSQLSource

This is the [Mouse Genome Informatics](http://www.informatics.jax.org/) resource, from which we process genotype and phenotype data about laboratory mice. Genotypes leverage the GENO genotype model.

Here, we connect to their public database, and download a subset of tables/views to get specifically at the geno-pheno data, then iterate over the tables. We end up effectively performing joins when adding nodes to the graph. In order to use this parser, you will need to have user/password connection details in your conf.json file, like: dbauth : {‘mgi’ : {‘user’ : ‘<username>’, ‘password’ : ‘<password>’}} You can request access by contacting mgi-help@jax.org

fetch(is_dl_forced=False)

For the MGI resource, we connect to the remote database, and pull the tables into local files. We’ll check the local table versions against the remote version :return:

fetch_transgene_genes_from_db(cxn)

This is a custom query to fetch the non-mouse genes that are part of transgene alleles.

Parameters:cxn
Returns:
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)

We process each of the postgres tables in turn. The order of processing is important here, as we build up a hashmap of internal vs external identifers (unique keys by type to MGI id). These include allele, marker (gene), publication, strain, genotype, annotation (association), and descriptive notes. :param limit: Only parse this many lines of each table :return:

process_mgi_note_allele_view(limit=None)

These are the descriptive notes about the alleles. Note that these notes have embedded HTML - should we do anything about that? :param limit: :return:

process_mgi_relationship_transgene_genes(limit=None)

Here, we have the relationship between MGI transgene alleles, and the non-mouse gene ids that are part of them. We augment the allele with the transgene parts.

Parameters:limit
Returns:
resources = [{'Force': True, 'outfile': 'mgi_dbinfo', 'query': '../../resources/sql/mgi/mgi_dbinfo.sql'}, {'outfile': 'gxd_genotype_view', 'query': '../../resources/sql/mgi/gxd_genotype_view.sql'}, {'outfile': 'gxd_genotype_summary_view', 'query': '../../resources/sql/mgi/gxd_genotype_summary_view.sql'}, {'outfile': 'gxd_allelepair_view', 'query': '../../resources/sql/mgi/gxd_allelepair_view.sql'}, {'outfile': 'all_summary_view', 'query': '../../resources/sql/mgi/all_summary_view.sql'}, {'outfile': 'all_allele_view', 'query': '../../resources/sql/mgi/all_allele_view.sql'}, {'outfile': 'all_allele_mutation_view', 'query': '../../resources/sql/mgi/all_allele_mutation_view.sql'}, {'outfile': 'mrk_marker_view', 'query': '../../resources/sql/mgi/mrk_marker_view.sql'}, {'outfile': 'voc_annot_view', 'query': '../../resources/sql/mgi/voc_annot_view.sql'}, {'outfile': 'evidence_view', 'query': '../../resources/sql/mgi/evidence.sql'}, {'outfile': 'bib_acc_view', 'query': '../../resources/sql/mgi/bib_acc_view.sql'}, {'outfile': 'prb_strain_view', 'query': '../../resources/sql/mgi/prb_strain_view.sql'}, {'outfile': 'mrk_summary_view', 'query': '../../resources/sql/mgi/mrk_summary_view.sql'}, {'outfile': 'mrk_acc_view', 'query': '../../resources/sql/mgi/mrk_acc_view.sql'}, {'outfile': 'prb_strain_acc_view', 'query': '../../resources/sql/mgi/prb_strain_acc_view.sql'}, {'outfile': 'prb_strain_genotype_view', 'query': '../../resources/sql/mgi/prb_strain_genotype_view.sql'}, {'outfile': 'mgi_note_vocevidence_view', 'query': '../../resources/sql/mgi/mgi_note_vocevidence_view.sql'}, {'outfile': 'mgi_note_allele_view', 'query': '../../resources/sql/mgi/mgi_note_allele_view.sql'}, {'outfile': 'mrk_location_cache', 'query': '../../resources/sql/mgi/mrk_location_cache.sql'}]
test_keys = {'allele': [1612, 1609, 1303, 56760, 816699, 51074, 14595, 816707, 246, 38139, 4334, 817387, 8567, 476, 42885, 3658, 1193, 6978, 6598, 16698, 626329, 33649, 835532, 7861, 33649, 6308, 1285, 827608], 'annot': [6778, 12035, 189442, 189443, 189444, 189445, 189446, 189447, 189448, 189449, 189450, 189451, 189452, 318424, 717023, 717024, 717025, 717026, 717027, 717028, 717029, 5123647, 928426, 5647502, 6173775, 6173778, 6173780, 6173781, 6620086, 13487622, 13487623, 13487624, 23241933, 23534428, 23535949, 23546035, 24722398, 29645663, 29645664, 29645665, 29645666, 29645667, 29645682, 43803707, 43804057, 43805682, 43815003, 43838073, 58485679, 59357863, 59357864, 59357865, 59357866, 59357867, 60448185, 60448186, 60448187, 62628962, 69611011, 69611253, 79642481, 79655585, 80436328, 83942519, 84201418, 90942381, 90942382, 90942384, 90942385, 90942386, 90942389, 90942390, 90942391, 90942392, 92947717, 92947729, 92947735, 92947757, 92948169, 92948441, 92948518, 92949200, 92949301, 93092368, 93092369, 93092370, 93092371, 93092372, 93092373, 93092374, 93092375, 93092376, 93092377, 93092378, 93092379, 93092380, 93092381, 93092382, 93401080, 93419639, 93436973, 93436974, 93436975, 93436976, 93436977, 93459094, 93459095, 93459096, 93459097, 93484431, 93484432, 93491333, 93491334, 93491335, 93491336, 93491337, 93510296, 93510297, 93510298, 93510299, 93510300, 93548463, 93551440, 93552054, 93576058, 93579091, 93579870, 93581813, 93581832, 93581841, 93581890, 93583073, 93583786, 93584586, 93587213, 93604448, 93607816, 93613038, 93614265, 93618579, 93620355, 93621390, 93624755, 93626409, 93626918, 93636629, 93642680, 93643814, 93643825, 93647695, 93648755, 93652704, 5123647, 71668107, 71668108, 71668109, 71668110, 71668111, 71668112, 71668113, 71668114, 74136778, 107386012, 58485691], 'genotype': [81, 87, 142, 206, 281, 283, 286, 287, 341, 350, 384, 406, 407, 411, 425, 457, 458, 461, 476, 485, 537, 546, 551, 553, 11702, 12910, 13407, 13453, 14815, 26655, 28610, 37313, 38345, 59766, 60082, 65406, 64235], 'marker': [357, 38043, 305574, 444020, 34578, 9503, 38712, 17679, 445717, 38415, 12944, 377, 77197, 18436, 30157, 14252, 412465, 38598, 185833, 35408, 118781, 37270, 31169, 25040, 81079], 'notes': [5114, 53310, 53311, 53312, 53313, 53314, 53315, 53316, 53317, 53318, 53319, 53320, 71099, 501751, 501752, 501753, 501754, 501755, 501756, 501757, 744108, 1055341, 6049949, 6621213, 6621216, 6621218, 6621219, 7108498, 14590363, 14590364, 14590365, 25123358, 25123360, 26688159, 32028545, 32028546, 32028547, 32028548, 32028549, 32028564, 37833486, 47742903, 47743253, 47744878, 47754199, 47777269, 65105483, 66144014, 66144015, 66144016, 66144017, 66144018, 70046116, 78382808, 78383050, 103920312, 103920318, 103920319, 103920320, 103920322, 103920323, 103920324, 103920325, 103920326, 103920328, 103920330, 103920331, 103920332, 103920333, 106390006, 106390018, 106390024, 106390046, 106390458, 106390730, 106390807, 106391489, 106391590, 106579450, 106579451, 106579452, 106579453, 106579454, 106579455, 106579456, 106579457, 106579458, 106579459, 106579460, 106579461, 106579462, 106579463, 106579464, 106949909, 106949910, 106969368, 106969369, 106996040, 106996041, 106996042, 106996043, 106996044, 107022123, 107022124, 107022125, 107022126, 107052057, 107052058, 107058959, 107058960, 107058961, 107058962, 107058963, 107077922, 107077923, 107077924, 107077925, 107077926, 107116089, 107119066, 107119680, 107154485, 107155254, 107158128, 107159385, 107160435, 107163154, 107163183, 107163196, 107163271, 107164877, 107165872, 107166942, 107168838, 107170557, 107174867, 107194346, 107198590, 107205179, 107206725, 107212120, 107214364, 107214911, 107215700, 107218519, 107218642, 107219974, 107221415, 107222064, 107222717, 107235068, 107237686, 107242709, 107244121, 107244139, 107248964, 107249091, 107250401, 107251870, 107255383, 107256603], 'pub': [73197, 165659, 134151, 76922, 181903, 26681, 128938, 80054, 156949, 159965, 53672, 170462, 206876, 87798, 100777, 176693, 139205, 73199, 74017, 102010, 152095, 18062, 216614, 61933, 13385, 32366, 114625, 182408, 140802], 'strain': [30639, 33832, 33875, 33940, 36012, 59504, 34338, 34382, 47670, 59802, 33946, 31421, 64, 40, 14, -2, 30639, 15975, 35077, 12610, -1, 28319, 27026, 141, 62299]}
dipper.sources.MGISlim module
class dipper.sources.MGISlim.MGISlim(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

slim mgi model only containing Gene to phenotype associations Uses mousemine: http://www.mousemine.org/mousemine/begin.do

fetch(is_dl_forced)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

dipper.sources.MMRRC module
class dipper.sources.MMRRC.MMRRC(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

Here we process the Mutant Mouse Resource and Research Center (https://www.mmrrc.org) strain data, which includes: * strains, their mutant alleles * phenotypes of the alleles * descriptions of the research uses of the strains

Note that some gene identifiers are not included (for many of the transgenics with human genes) in the raw data. We do our best to process the links between the variant and the affected gene, but sometimes the mapping is not clear, and we do not include it. Many of these details will be solved by merging this source with the MGI data source, who has the variant-to-gene designations.

Also note that even though the strain pages at the MMRRC site do list phenotypic differences in the context of the strain backgrounds, they do not provide that data to us, and thus we cannot supply that disambiguation here.

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'catalog': {'url': 'https://www.mmrrc.org/about/mmrrc_catalog_data.csv', 'file': 'mmrrc_catalog_data.csv'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

test_ids = ['MMRRC:037507-MU', 'MMRRC:041175-UCD', 'MMRRC:036933-UNC', 'MMRRC:037884-UCD', 'MMRRC:000255-MU', 'MMRRC:037372-UCD', 'MMRRC:000001-UNC']
dipper.sources.MPD module
class dipper.sources.MPD.MPD(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

From the [MPD](http://phenome.jax.org/) website: This resource is a collaborative standardized collection of measured data on laboratory mouse strains and populations. Includes baseline phenotype data sets as well as studies of drug, diet, disease and aging effect. Also includes protocols, projects and publications, and SNP, variation and gene expression studies.

Here, we pull the data and model the genotypes using GENO and the genotype-to-phenotype associations using the OBAN schema.

MPD provide measurements for particular assays for several strains. Each of these measurements is itself mapped to a MP or VT term as a phenotype. Therefore, we can create a strain-to-phenotype association based on those strains that lie outside of the “normal” range for the given measurements. We can compute the average of the measurements for all strains tested, and then threshold any extreme measurements being beyond some threshold beyond the average.

Our default threshold here, is +/-2 standard deviations beyond the mean.

Because the measurements are made and recorded at the level of a specific sex of each strain, we associate the MP/VT phenotype with the sex-qualified genotype/strain.

MPDDL = 'http://phenomedoc.jax.org/MPD_downloads'
static build_measurement_description(row)
static check_header(filename, header)
fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'assay_metadata': {'url': 'http://phenomedoc.jax.org/MPD_downloads/measurements.csv', 'file': 'measurements.csv'}, 'ontology_mappings': {'url': 'http://phenomedoc.jax.org/MPD_downloads/ontology_mappings.csv', 'file': 'ontology_mappings.csv'}, 'straininfo': {'url': 'http://phenomedoc.jax.org/MPD_downloads/straininfo.csv', 'file': 'straininfo.csv'}, 'strainmeans': {'url': 'http://phenomedoc.jax.org/MPD_downloads/strainmeans.csv.gz', 'file': 'strainmeans.csv.gz'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

mgd_agent_id = 'MPD:db/q?rtn=people/allinv'
mgd_agent_label = 'Mouse Phenotype Database'
mgd_agent_type = 'foaf:organization'
static normalise_units(units)
parse(limit=None)

MPD data is delivered in four separate csv files and one xml file, which we process iteratively and write out as one large graph.

Parameters:limit
Returns:
test_ids = ['MPD:6', 'MPD:849', 'MPD:425', 'MPD:569', 'MPD:10', 'MPD:1002', 'MPD:39', 'MPD:2319']
dipper.sources.Monarch module
class dipper.sources.Monarch.Monarch(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

This is the parser for data curated by the [Monarch Initiative](https://monarchinitiative.org). Data is currently maintained in a private repository, soon to be released.

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

process_omia_phenotypes(limit)
dipper.sources.Monochrom module
class dipper.sources.Monochrom.Monochrom(graph_type, are_bnodes_skolemized, tax_ids=None)

Bases: dipper.sources.Source.Source

This class will leverage the GENO ontology and modeling patterns to build an ontology of chromosomes for any species. These classes represent major structural pieces of Chromosomes which are often universally referenced, using physical properties/observations that remain constant over different genome builds (such as banding patterns and arms). The idea is to create a scaffold upon which we can hang build-specific chromosomal coordinates, and reason across them.

In general, this will take the cytogenic bands files from UCSC, and create missing grouping classes, in order to build the partonomy from a very specific chromosomal band up through the chromosome itself and enable overlap and containment queries. We use RO:subsequence_of as our relationship between nested chromosomal parts. For example, 13q21.31 ==> 13q21.31, 13q21.3, 13q21, 13q2, 13q, 13

At the moment, this only computes the bands for Human, Mouse, Zebrafish, and Rat but will be expanding in the future as needed.

Because this is a universal framework to represent the chromosomal structure of any species, we must mint identifiers for each chromosome and part. We differentiate species by first creating a species-specific genome, then for each species-specific chromosome we include the NCBI taxon number together with the chromosome number, like: `<species number>chr<num><band>`. For 13q21.31, this would be 9606chr13q21.31. We then create triples for a given band like: <pre> CHR:9606chr1p36.33 rdf[type] SO:chromosome_band CHR:9606chr1p36 subsequence_of :9606chr1p36.3 </pre> where any band in the file is an instance of a chr_band (or a more specific type), is a subsequence of it’s containing region.

We determine the containing regions of the band by parsing the band-string; since each alphanumeric is a significant “place”, we can split it with the shorter strings being parents of the longer string

Since this is small, and we have not limited other items in our test set to a small region, we simply use the whole graph (genome) for testing purposes, and copy the main graph to the test graph.

Since this Dipper class is building an ONTOLOGY, rather than instance-level data, we must also include domain and range constraints, and other owl-isms.

TODO: any species by commandline argument

We are currently mapping these to the CHR idspace, but this is NOT YET APPROVED and is subject to change.

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'10090': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/cytoBandIdeo.txt.gz', 'build_num': 'mm10', 'genome_label': 'Mouse', 'file': '10090cytoBand.txt.gz'}, '10116': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/rn6/database/cytoBandIdeo.txt.gz', 'build_num': 'rn6', 'genome_label': 'Rat', 'file': '10116cytoBand.txt.gz'}, '7955': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/danRer10/database/cytoBandIdeo.txt.gz', 'build_num': 'danRer10', 'genome_label': 'Zebrafish', 'file': '7955cytoBand.txt.gz'}, '9031': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/galGal4/database/cytoBandIdeo.txt.gz', 'build_num': 'galGal4', 'genome_label': 'chicken', 'file': 'galGal4cytoBand.txt.gz'}, '9606': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz', 'build_num': 'hg19', 'genome_label': 'Human', 'file': '9606cytoBand.txt.gz'}, '9796': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/equCab2/database/cytoBandIdeo.txt.gz', 'build_num': 'equCab2', 'genome_label': 'horse', 'file': 'equCab2cytoBand.txt.gz'}, '9823': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/susScr3/database/cytoBandIdeo.txt.gz', 'build_num': 'susScr3', 'genome_label': 'pig', 'file': 'susScr3cytoBand.txt.gz'}, '9913': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/bosTau7/database/cytoBandIdeo.txt.gz', 'build_num': 'bosTau7', 'genome_label': 'cow', 'file': 'bosTau7cytoBand.txt.gz'}, '9940': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/oviAri3/database/cytoBandIdeo.txt.gz', 'build_num': 'oviAri3', 'genome_label': 'sheep', 'file': 'oviAri3cytoBand.txt.gz'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

make_parent_bands(band, child_bands)

this will determine the grouping bands that it belongs to, recursively 13q21.31 ==> 13, 13q, 13q2, 13q21, 13q21.3, 13q21.31

Parameters:
  • band
  • child_bands
Returns:

map_type_of_region(regiontype)

Note that “stalk” refers to the short arm of acrocentric chromosomes chr13,14,15,21,22 for human. :param regiontype: :return:

parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

region_type_map = {'acen': 'SO:0000577', 'chromosome': 'SO:0000340', 'chromosome_arm': 'SO:0000105', 'chromosome_band': 'SO:0000341', 'chromosome_part': 'SO:0000830', 'gneg': 'SO:0000341', 'gpos100': 'SO:0000341', 'gpos25': 'SO:0000341', 'gpos33': 'SO:0000341', 'gpos50': 'SO:0000341', 'gpos66': 'SO:0000341', 'gpos75': 'SO:0000341', 'gvar': 'SO:0000341', 'stalk': 'SO:0000341'}
dipper.sources.Monochrom.getChrPartTypeByNotation(notation)

This method will figure out the kind of feature that a given band is based on pattern matching to standard karyotype notation. (e.g. 13q22.2 ==> chromosome sub-band)

This has been validated against human, mouse, fish, and rat nomenclature. :param notation: the band (without the chromosome prefix) :return:

dipper.sources.MyChem module
class dipper.sources.MyChem.MyChem(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

static add_relation(results, relation)
static check_uniprot(target_dict)
static chunks(l, n)

Yield successive n-sized chunks from l.

static execute_query(query)
fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

fetch_from_mychem()
static format_actions(target_dict)
static get_drug_record(ids, fields)
static get_inchikeys()
make_triples(source, package)
parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

static return_target_list(targ_in)
dipper.sources.MyDrug module
class dipper.sources.MyDrug.MyDrug(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

Drugs and Compounds stored in the BioThings database

MY_DRUG_API = 'http://c.biothings.io/v1/query'
checkIfRemoteIsNewer(localfile)

Need to figure out how biothings records releases, for now if the file exists we will assume it is a fully downloaded cache :param localfile: str file path :return: boolean True if remote file is newer else False

fetch(is_dl_forced=False)

Note there is a unpublished mydrug client that works like this: from mydrug import MyDrugInfo md = MyDrugInfo() r = list(md.query(‘_exists_:aeolus’, fetch_all=True))

Parameters:is_dl_forced – boolean, force download
Returns:
files = {'aeolus': {'file': 'aeolus.json'}}
parse(limit=None, or_limit=1)

Parse mydrug files :param limit: int limit json docs processed :param or_limit: int odds ratio limit :return: None

dipper.sources.NCBIGene module
class dipper.sources.NCBIGene.NCBIGene(graph_type, are_bnodes_skolemized, tax_ids=None, gene_ids=None)

Bases: dipper.sources.Source.Source

This is the processing module for the National Center for Biotechnology Information. It includes parsers for the gene_info (gene names, symbols, ids, equivalent ids), gene history (alt ids), and gene2pubmed publication references about a gene.

This creates Genes as classes, when they are properly typed as such. For those entries where it is an ‘unknown significance’, it is added simply as an instance of a sequence feature. It will add equivalentClasses for a subset of external identifiers, including: ENSEMBL, HGMD, MGI, ZFIN, and gene product links for HPRD. They are additionally located to their Chromosomal band (until we process actual genomic coords in a separate file).

We process the genes from the filtered taxa, starting with those configured by default (human, mouse, fish). This can be overridden in the calling script to include additional taxa, if desired. The gene ids in the conf.json will be used to subset the data when testing.

All entries in the gene_history file are added as deprecated classes, and linked to the current gene id, with “replaced_by” relationships.

Since we do not know much about the specific link in the gene2pubmed; we simply create a “mentions” relationship.

SCIGRAPH_BASE = 'https://scigraph-ontology-dev.monarchinitiative.org/scigraph/graph/'
add_orthologs_by_gene_group(graph, gene_ids)

This will get orthologies between human and other vertebrate genomes based on the gene_group annotation pipeline from NCBI. More information 9can be learned here: http://www.ncbi.nlm.nih.gov/news/03-13-2014-gene-provides-orthologs-regions/ The method for associations is described in [PMCID:3882889](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3882889/) == [PMID:24063302](http://www.ncbi.nlm.nih.gov/pubmed/24063302/). Because these are only between human and vertebrate genomes, they will certainly miss out on very distant orthologies, and should not be considered complete.

We do not run this within the NCBI parser itself; rather it is a convenience function for others parsers to call.

Parameters:
  • graph
  • gene_ids – Gene ids to fetch the orthology
Returns:

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'gene2pubmed': {'url': 'http://ftp.ncbi.nih.gov/gene/DATA/gene2pubmed.gz', 'file': 'gene2pubmed.gz'}, 'gene_group': {'url': 'http://ftp.ncbi.nih.gov/gene/DATA/gene_group.gz', 'file': 'gene_group.gz'}, 'gene_history': {'url': 'http://ftp.ncbi.nih.gov/gene/DATA/gene_history.gz', 'file': 'gene_history.gz'}, 'gene_info': {'url': 'http://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz', 'file': 'gene_info.gz'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

static map_type_of_gene(sotype)
parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

resources = {'clique_leader': '../../resources/clique_leader.yaml'}
dipper.sources.OMA module
class dipper.sources.OMA.OMA(graph_type, are_bnodes_skolemized, tax_ids=None)

Bases: dipper.sources.OrthoXML.OrthoXML

BENCHMARK_BASE = 'https://omabrowser.org/ReferenceProteomes'
files = {'oma_hogs': {'url': 'https://omabrowser.org/ReferenceProteomes/OMA_GETHOGs-2_2017-04.orthoxml.gz', 'file': 'OMA_GETHOGs-2_2017-04.orthoxml.gz'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

dipper.sources.OMIA module
class dipper.sources.OMIA.OMIA(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

This is the parser for the [Online Mendelian Inheritance in Animals (OMIA)](http://www.http://omia.angis.org.au), from which we process inherited disorders, other (single-locus) traits, and genes in >200 animal species (other than human and mouse and rats).

We generate the omia graph to include the following information: * genes * animal taxonomy, and breeds as instances of those taxa

(breeds are akin to “strains” in other taxa)
  • animal diseases, along with species-specific subtypes of those diseases
  • publications (and their mapping to PMIDs, if available)
  • gene-to-phenotype associations (via an anonymous variant-locus
  • breed-to-phenotype associations

We make links between OMIA and OMIM in two ways: 1. mappings between OMIA and OMIM are created as OMIA –> hasdbXref OMIM 2. mappings between a breed and OMIA disease are created

to be a model for the mapped OMIM disease, IF AND ONLY IF it is a 1:1 mapping. there are some 1:many mappings, and these often happen if the OMIM item is a gene.

Because many of these species are not covered in the PANTHER orthology datafiles, we also pull any orthology relationships from the gene_group files from NCBI.

clean_up_omim_genes()
fetch(is_dl_forced=False)
Parameters:is_dl_forced
Returns:
files = {'data': {'url': 'http://compldb.angis.org.au/dumps/omia.xml.gz', 'file': 'omia.xml.gz'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

make_breed_id(key)
map_omia_group_category_to_ontology_id(category_num)

Using the category number in the OMIA_groups table, map them to a disease id. This may be superceeded by other MONDO methods.

Platelet disorders will be more specific once https://github.com/obophenotype/human-disease-ontology/issues/46 is fulfilled.

Parameters:category_num
Returns:
parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

process_associations(limit)

Loop through the xml file and process the article-breed, article-phene, breed-phene, phene-gene associations, and the external links to LIDA.

Parameters:limit
Returns:
process_classes(limit)

Loop through the xml file and process the articles, breed, genes, phenes, and phenotype-grouping classes. We add elements to the graph, and store the id-to-label in the label_hash dict, along with the internal key-to-external id in the id_hash dict. The latter are referenced in the association processing functions.

Parameters:limit
Returns:
process_species(limit)

Loop through the xml file and process the species. We add elements to the graph, and store the id-to-label in the label_hash dict. :param limit: :return:

scrub()

The XML file seems to have mixed-encoding; we scrub out the control characters from the file for processing.

i.e.?i omia.xml:1555328.28: PCDATA invalid Char value 2 <field name=”journal”>Bulletin et Memoires de la Societe Centrale de Medic

Returns:
write_molgen_report()
dipper.sources.OMIM module
class dipper.sources.OMIM.OMIM(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

The only anonymously obtainable data from the ftp site is mim2gene. However, more detailed information is available via their API. So, we pull the omim identifiers from their ftp site, then query their API in batchs of 20. Their prescribed rate limits have been mecurial

one per two seconds or four per second, in 2017 November all mention of api rate limits have vanished (save 20 IDs per call if any include is used)

Note this ingest requires an api Key which is not stored in the repo, but in a separate conf.json file.

Processing this source serves two purposes: 1. the creation of the OMIM classes for merging into the disease ontology 2. add annotations such as disease-gene associations

When creating the disease classes, we pull from their REST-api id/label/definition information. Additionally we pull the Orphanet and UMLS mappings (to make equivalent ids). We also pull the phenotypic series annotations as grouping classes.

fetch(is_dl_forced=True)

Get the preconfigured static files. This DOES NOT fetch the individual records via REST…that is handled in the parsing function. (To be refactored.) over riding Source.fetch() calling Source.get_files() :param is_dl_forced: :return:

files = {'all': {'url': 'https://omim.org/static/omim/data/mim2gene.txt', 'clean': 'https://data.omim.org/downloads/', 'file': 'mim2gene.txt'}, 'morbidmap': {'url': 'https://data.omim.org/downloads//morbidmap.txt', 'clean': 'https://data.omim.org/downloads/', 'file': 'morbidmap.txt'}, 'phenotypicSeries': {'url': 'https://omim.org/phenotypicSeriesTitle/all?format=tsv', 'headers': {'User-Agent': 'The Monarch Initiative (https://monarchinitiative.org/; info@monarchinitiative.org)'}, 'clean': 'https://data.omim.org/downloads/', 'file': 'phenotypic_series_title_all.txt'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

process_entries(omimids, transform, included_fields=None, graph=None, limit=None)

Given a list of omim ids, this will use the omim API to fetch the entries, according to the `included_fields` passed as a parameter. If a transformation function is supplied, this will iterate over each entry, and either add the results to the supplied `graph` or will return a set of processed entries that the calling function can further iterate.

If no `included_fields` are provided, this will simply fetch the basic entry from omim, which includes an entry’s: prefix, mimNumber, status, and titles.

Parameters:
  • omimids – the set of omim entry ids to fetch using their API
  • transform – Function to transform each omim entry when looping
  • included_fields – A set of what fields are required to retrieve from the API
  • graph – the graph to add the transformed data into
Returns:

test_ids = [119600, 120160, 157140, 158900, 166220, 168600, 219700, 253250, 305900, 600669, 601278, 602421, 605073, 607822, 102560, 102480, 100678, 102750, 600201, 104200, 105400, 114480, 115300, 121900, 107670, 11600, 126453, 102150, 104000, 107200, 100070, 611742, 611100, 102480]
dipper.sources.OMIM.filter_keep_phenotype_entry_ids(entry, graph=None)
dipper.sources.OMIM.get_omim_id_from_entry(entry)
dipper.sources.Orphanet module
class dipper.sources.Orphanet.Orphanet(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

Orphanet’s aim is to help improve the diagnosis, care and treatment of patients with rare diseases. For Orphanet, we are currently only parsing the disease-gene associations.

Note that ???
fetch(is_dl_forced=False)
Parameters:is_dl_forced
Returns:
files = {'disease-gene': {'url': 'http://www.orphadata.org/data/xml/en_product6.xml', 'file': 'en_product6.xml'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

dipper.sources.OrthoXML module
class dipper.sources.OrthoXML.OrthoXML(graph_type, are_bnodes_skolemized, method, tax_ids=None)

Bases: dipper.sources.Source.Source

Extract the induced pairwise relations from an OrthoXML file.

This base class is primarily intended to extract the orthologous and paralogous relations from a file in OrthoXML file containing the QfO reference species data set.

A concreate method should subclass this class and overwrite the constructor method to provide the information about the dataset and a method name.

add_protein_to_graph

adds protein nodes to the graph and adds a “in_taxon” triple.

for efficency reasons, we cache which proteins we have already added using a least recently used cache.

clean_protein_id(protein_id)

makes sure protein_id is properly prefixed

extract_taxon_info(gene_node)

extract the ncbi taxon id from a gene_node

default implementation goes up to the species node in the xml and extracts the id from the attribute at that node.

fetch(is_dl_forced=False)
Returns:None
files = {}
parse(limit=None)
Returns:None
class dipper.sources.OrthoXML.OrthoXMLParser(xml)

Bases: object

default_node_list()
extract_pairwise_relations(node=None)
get_children(node)
is_internal_node(node)
is_leaf(node)
leaf_label(leaf)
dipper.sources.Panther module
class dipper.sources.Panther.Panther(graph_type, are_bnodes_skolemized, tax_ids=None)

Bases: dipper.sources.Source.Source

The pairwise orthology calls from Panther DB: http://pantherdb.org/ encompass 22 species, from the RefGenome and HCOP projects. Here, we map the orthology classes to RO homology relationships This resource may be extended in the future with additional species.

This currently makes a graph of orthologous relationships between genes, with the assumption that gene metadata (labels, equivalent ids) are provided from other sources.

Gene families are nominally created from the orthology files, though these are incomplete with no hierarchical (subfamily) information. This will get updated from the HMM files in the future.

Note that there is a fair amount of identifier cleanup performed to align with our standard CURIE prefixes.

The test graph of data is output based on configured “protein” identifiers in conf.json.

By default, this will produce a file with ALL orthologous relationships. IF YOU WANT ONLY A SUBSET, YOU NEED TO PROVIDE A FILTER UPON CALLING THIS WITH THE TAXON IDS

PNTHDL = 'ftp://ftp.pantherdb.org/ortholog/current_release'
fetch(is_dl_forced=False)
Returns:None
files = {'hcop': {'url': 'ftp://ftp.pantherdb.org/ortholog/current_release/Orthologs_HCOP.tar.gz', 'file': 'Orthologs_HCOP.tar.gz'}, 'refgenome': {'url': 'ftp://ftp.pantherdb.org/ortholog/current_release/RefGenomeOrthologs.tar.gz', 'file': 'RefGenomeOrthologs.tar.gz'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)
Returns:None
dipper.sources.PostgreSQLSource module
class dipper.sources.PostgreSQLSource.PostgreSQLSource(graph_type, are_bnodes_skolemized, name=None)

Bases: dipper.sources.Source.Source

Class for interfacing with remote Postgres databases

fetch_from_pgdb(tables, cxn, limit=None, force=False)
Will fetch all Postgres tables from the specified database
in the cxn connection parameters.
This will save them to a local file named the same as the table,
in tab-delimited format, including a header.
Parameters:
  • tables – Names of tables to fetch
  • cxn – database connection details
  • limit – A max row count to fetch for each table
Returns:

None

fetch_query_from_pgdb(qname, query, con, cxn, limit=None, force=False)

Supply either an already established connection, or connection parameters. The supplied connection will override any separate cxn parameter :param qname: The name of the query to save the output to :param query: The SQL query itself :param con: The already-established connection :param cxn: The postgres connection information :param limit: If you only want a subset of rows from the query :return:

dipper.sources.RGD module
class dipper.sources.RGD.RGD(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

Ingest of Rat Genome Database gene to mammalian phenotype gaf file

RGD_BASE = 'ftp://ftp.rgd.mcw.edu/pub/data_release/annotated_rgd_objects_by_ontology/'
fetch(is_dl_forced=False)

Override Source.fetch() Fetches resources from rat_genome_database using the rat_genome_database ftp site Args:

param is_dl_forced (bool):
 Force download
Returns:
:return None
files = {'rat_gene2mammalian_phenotype': {'url': 'ftp://ftp.rgd.mcw.edu/pub/data_release/annotated_rgd_objects_by_ontology/rattus_genes_mp', 'file': 'rattus_genes_mp'}}
make_association(record)

contstruct the association :param record: :return: modeled association of genotype to mammalian phenotype

parse(limit=None)

Override Source.parse() Args:

:param limit (int, optional) limit the number of rows processed
Returns:
:return None
dipper.sources.Reactome module
class dipper.sources.Reactome.Reactome(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

Reactome is a free, open-source, curated and peer reviewed pathway database. (http://reactome.org/)

REACTOME_BASE = 'http://www.reactome.org/download/current/'
fetch(is_dl_forced=False)

Override Source.fetch() Fetches resources from reactome using the Reactome.files dictionary Args:

param is_dl_forced (bool):
 Force download
Returns:
:return None
files = {'chebi2pathway': {'url': 'http://www.reactome.org/download/current/ChEBI2Reactome.txt', 'file': 'ChEBI2Reactome.txt'}, 'ensembl2pathway': {'url': 'http://www.reactome.org/download/current/Ensembl2Reactome.txt', 'file': 'Ensembl2Reactome.txt'}}
map_files = {'eco_map': 'http://purl.obolibrary.org/obo/eco/gaf-eco-mapping.txt'}
parse(limit=None)

Override Source.parse() Args:

:param limit (int, optional) limit the number of rows processed
Returns:
:return None
dipper.sources.SGD module
class dipper.sources.SGD.SGD(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

Ingest of Saccharomyces Genome Database (SGD) phenotype associations

SGD_BASE = 'https://downloads.yeastgenome.org/curation/literature/'
fetch(is_dl_forced=False)

Override Source.fetch() Fetches resources from rat_genome_database using the rat_genome_database ftp site Args:

param is_dl_forced (bool):
 Force download
Returns:
:return None
files = {'sgd_phenotype': {'url': 'https://downloads.yeastgenome.org/curation/literature/phenotype_data.tab', 'file': 'phenotype_data.tab'}}
static make_apo_map()
make_association(record)

contstruct the association :param record: :return: modeled association of genotype to mammalian phenotype

parse(limit=None)

Override Source.parse() Args:

:param limit (int, optional) limit the number of rows processed
Returns:
:return None
dipper.sources.Source module
class dipper.sources.Source.Source(graph_type, are_bnodes_skized=False, name=None)

Bases: object

Abstract class for any data sources that we’ll import and process. Each of the subclasses will fetch() the data, scrub() it as necessary, then parse() it into a graph. The graph will then be written out to a single self.name().<dest_fmt> file.

checkIfRemoteIsNewer(remote, local, headers)

Given a remote file location, and the corresponding local file this will check the datetime stamp on the files to see if the remote one is newer. This is a convenience method to be used so that we don’t have to re-fetch files that we already have saved locally :param remote: URL of file to fetch from remote server :param local: pathname to save file to locally :return: True if the remote file is newer and should be downloaded

compare_local_remote_bytes(remotefile, localfile, remote_headers=None)

test to see if fetched file is the same size as the remote file using information in the content-length field in the HTTP header :return: True or False

declareAsOntology(graph)

The file we output needs to be declared as an ontology, including it’s version information.

TEC: I am not convinced dipper reformating external data as RDF triples makes an OWL ontology (nor that it should be considered a goal).

Proper ontologies are built by ontologists. Dipper reformats data and anotates/decorates it with a minimal set of carefully arranged terms drawn from from multiple proper ontologies. Which allows the whole (dipper’s RDF triples and parent ontologies) to function as a single ontology we can reason over when combined in a store such as SciGraph.

Including more than the minimal ontological terms in dipper’s RDF output constitutes a liability as it allows greater divergence between dipper artifacts and the proper ontologies.

Further information will be augmented in the dataset object. :param version: :return:

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

fetch_from_url(remotefile, localfile=None, is_dl_forced=False, headers=None)

Given a remote url and a local filename, attempt to determine if the remote file is newer; if it is, fetch the remote file and save it to the specified localfile, reporting the basic file information once it is downloaded :param remotefile: URL of remote file to fetch :param localfile: pathname of file to save locally :return: None

file_len(fname)
files = {}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

static get_eco_map(url)

To conver the three column file to a hashmap we join primary and secondary keys, for example IEA GO_REF:0000002 ECO:0000256 IEA GO_REF:0000003 ECO:0000501 IEA Default ECO:0000501

becomes IEA-GO_REF:0000002: ECO:0000256 IEA-GO_REF:0000003: ECO:0000501 IEA: ECO:0000501

Returns:dict
get_file_md5(directory, file, blocksize=1048576)
get_files(is_dl_forced, files=None)

Given a set of files for this source, it will go fetch them, and set a default version by date. If you need to set the version number by another method, then it can be set again. :param is_dl_forced - boolean :param files dict - override instance files dict :return: None

get_local_file_size(localfile)
Parameters:localfile
Returns:size of file
get_remote_content_len(remote, headers=None)
Parameters:remote
Returns:size of remote file
static hash_id(long_string)

prepend ‘b’ to avoid leading with digit truncate to 64bit sized word return sha1 hash of string :param long_string: str string to be hashed :return: str hash of id

static make_id(long_string, prefix='MONARCH')

a method to create DETERMINISTIC identifiers based on a string’s digest. currently implemented with sha1 :param long_string: :return:

namespaces = {}
static open_and_parse_yaml(file)
Parameters:file – String, path to file containing label-id mappings in the first two columns of each row
Returns:dict where keys are labels and values are ids
parse(limit)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

static parse_mapping_file(file)
Parameters:file – String, path to file containing label-id mappings in the first two columns of each row
Returns:dict where keys are labels and values are ids
process_xml_table(elem, table_name, processing_function, limit)

This is a convenience function to process the elements of an xml document, when the xml is used as an alternative way of distributing sql-like tables. In this case, the “elem” is akin to an sql table, with it’s name of `table_name`. It will then process each `row` given the `processing_function` supplied.

Parameters:
  • elem – The element data
  • table_name – The name of the table to process
  • processing_function – The row processing function
  • limit

Appears to be making calls to the elementTree library although it not explicitly imported here.

Returns:
static remove_backslash_r(filename, encoding)

A helpful utility to remove Carriage Return from any file. This will read a file into memory, and overwrite the contents of the original file.

TODO: This function may be a liability

Parameters:filename
Returns:
settestmode(mode)

Set testMode to (mode). - True: run the Source in testMode; - False: run it in full mode :param mode: :return: None

settestonly(testonly)

Set that this source should only be processed in testMode :param testOnly: :return: None

whoami()
write(fmt='turtle', stream=None)

This convenience method will write out all of the graphs associated with the source. Right now these are hardcoded to be a single “graph” and a “src_dataset.ttl” and a “src_test.ttl” If you do not supply stream=’stdout’ it will default write these to files.

In addition, if the version number isn’t yet set in the dataset, it will be set to the date on file. :return: None

dipper.sources.StringDB module
class dipper.sources.StringDB.StringDB(graph_type, are_bnodes_skolemized, tax_ids=None, version=None)

Bases: dipper.sources.Source.Source

STRING is a database of known and predicted protein-protein interactions. The interactions include direct (physical) and indirect (functional) associations; they stem from computational prediction, from knowledge transfer between organisms, and from interactions aggregated from other (primary) databases. From: http://string-db.org/cgi/about.pl?footer_active_subpage=content

STRING uses one protein per gene. If there is more than one isoform per gene, we usually select the longest isoform, unless we have information that suggest that other isoform regarded as cannonical (e.g., proteins in the CCDS database). From: http://string-db.org/cgi/help.pl

DEFAULT_TAXA = [9606, 10090, 7955, 7227, 6239]
STRING_BASE = 'http://string-db.org/download/'
fetch(is_dl_forced=False)

Override Source.fetch() Fetches resources from String

We also fetch ensembl to determine if protein pairs are from the same species Args:

param is_dl_forced (bool):
 Force download
Returns:
:return None
parse(limit=None)

Override Source.parse() Args:

:param limit (int, optional) limit the number of rows processed
Returns:
:return None
dipper.sources.UCSCBands module
class dipper.sources.UCSCBands.UCSCBands(graph_type, are_bnodes_skolemized, tax_ids=None)

Bases: dipper.sources.Source.Source

This will take the UCSC defintions of cytogenic bands and create the nested structures to enable overlap and containment queries. We use `Monochrom.py` to create the OWL-classes of the chromosomal parts. Here, we simply worry about the instance-level values for particular genome builds.

Given a chr band definition, the nested containment structures look like: 13q21.31 ==> 13q21.31, 13q21.3, 13q21, 13q2, 13q, 13

We determine the containing regions of the band by parsing the band-string; since each alphanumeric is a significant “place”, we can split it with the shorter strings being parents of the longer string. # Here we create build-specific chroms, which are instances of the classes produced from `Monochrom.py`. You can instantiate any number of builds for a genome.

We leverage the Faldo model here for region definitions, and map each of the chromosomal parts to SO.

We differentiate the build by adding the build id to the identifier prior to the chromosome number. These then are instances of the species-specific chromosomal class.

The build-specific chromosomes are created like: <pre> <build number>chr<num><band> with triples for a given band like: _:hg19chr1p36.33

rdfs:type SO:chromosome_band, faldo:Region, CHR:9606chr1p36.33, subsequence_of _:hg19chr1p36.3, faldo:location [ a faldo:BothStrandPosition

faldo:begin 0, faldo:end 2300000, faldo:reference ‘hg19’

] .

</pre> where any band in the file is an instance of a chr_band (or a more specific type), is a subsequence of it’s containing region, and is located in the specified coordinates.

We do not have a separate graph for testing.

TODO: any species by commandline argument

HGGP = 'http://hgdownload.cse.ucsc.edu/goldenPath'
fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'10090': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/cytoBandIdeo.txt.gz', 'build_num': 'mm10', 'genome_label': 'Mouse', 'file': 'mm10cytoBand.txt.gz'}, '7955': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/danRer10/database/cytoBandIdeo.txt.gz', 'build_num': 'danRer10', 'genome_label': 'Zebrafish', 'file': 'danRer10cytoBand.txt.gz'}, '9031': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/galGal4/database/cytoBandIdeo.txt.gz', 'build_num': 'galGal4', 'genome_label': 'chicken', 'file': 'galGal4cytoBand.txt.gz'}, '9606': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz', 'build_num': 'hg19', 'genome_label': 'Human', 'file': 'hg19cytoBand.txt.gz'}, '9796': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/equCab2/database/cytoBandIdeo.txt.gz', 'build_num': 'equCab2', 'genome_label': 'horse', 'file': 'equCab2cytoBand.txt.gz'}, '9823': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/susScr3/database/cytoBandIdeo.txt.gz', 'build_num': 'susScr3', 'genome_label': 'pig', 'file': 'susScr3cytoBand.txt.gz'}, '9913': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/bosTau7/database/cytoBandIdeo.txt.gz', 'build_num': 'bosTau7', 'genome_label': 'cow', 'file': 'bosTau7cytoBand.txt.gz'}, '9940': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/oviAri3/database/cytoBandIdeo.txt.gz', 'build_num': 'oviAri3', 'genome_label': 'sheep', 'file': 'oviAri3cytoBand.txt.gz'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

dipper.sources.UDP module
class dipper.sources.UDP.UDP(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

The National Institutes of Health (NIH) Undiagnosed Diseases Program (UDP) is part of the Undiagnosed Disease Network (UDN), an NIH Common Fund initiative that focuses on the most puzzling medical cases referred to the NIH Clinical Center in Bethesda, Maryland. from https://www.genome.gov/27544402/the-undiagnosed-diseases-program/

Data is available by request for access via the NHGRI collaboration server: https://udplims-collab.nhgri.nih.gov/api

Note the fetcher requires credentials for the UDP collaboration server Credentials are added via a config file, config.json, in the following format {

“dbauth” : {
“udp”: {
“user”: “foo” “password”: “bar”

}

} See dipper/config.py for more information

Output of fetcher: udp_variants.tsv ‘Patient’, ‘Family’, ‘Chr’, ‘Build’, ‘Chromosome Position’, ‘Reference Allele’, ‘Variant Allele’, ‘Parent of origin’, ‘Allele Type’, ‘Mutation Type’, ‘Gene’, ‘Transcript’, ‘Original Amino Acid’, ‘Variant Amino Acid’, ‘Amino Acid Change’, ‘Segregates with’, ‘Position’, ‘Exon’, ‘Inheritance model’, ‘Zygosity’, ‘dbSNP ID’, ‘1K Frequency’, ‘Number of Alleles’

udp_phenotypes.tsv ‘Patient’, ‘HPID’, ‘Present’

The script also utilizes two mapping files udp_gene_map.tsv - generated from scripts/fetch-gene-ids.py,

gene symbols from udp_variants
udp_chr_rs.tsv - rsid(s) per coordinate greped from hg19 dbsnp file,
then disambiguated with eutils, see scripts/dbsnp/dbsnp.py
UDP_SERVER = 'https://udplims-collab.nhgri.nih.gov/api'
fetch(is_dl_forced=True)

Fetches data from udp collaboration server, see top level comments for class for more information :return:

files = {'patient_phenotypes': {'file': 'udp_phenotypes.tsv'}, 'patient_variants': {'file': 'udp_variants.tsv'}}
map_files = {'dbsnp_map': '../../resources/udp/udp_chr_rs.tsv', 'gene_coord_map': '../../resources/udp/gene_coordinates.tsv', 'patient_ids': '../../resources/udp/patient_ids.yaml'}
parse(limit=None)

Override Source.parse() Args:

:param limit (int, optional) limit the number of rows processed
Returns:
:return None
dipper.sources.WormBase module
class dipper.sources.WormBase.WormBase(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

This is the parser for the [C. elegans Model Organism Database (WormBase)](http://www.wormbase.org), from which we process genotype and phenotype data for laboratory worms (C.elegans and other nematodes).

We generate the wormbase graph to include the following information: * genes * sequence alterations (includes SNPs/del/ins/indel and

large chromosomal rearrangements)
  • RNAi as expression-affecting reagents
  • genotypes, and their components
  • strains
  • publications (and their mapping to PMIDs, if available)
  • allele-to-phenotype associations (including variants by RNAi)
  • genetic positional information for genes and sequence alterations

Genotypes leverage the GENO genotype model and includes both intrinsic and extrinsic genotypes. Where necessary, we create anonymous nodes of the genotype partonomy (i.e. for variant single locus complements, genomic variation complements, variant loci, extrinsic genotypes, and extrinsic genotype parts).

TODO: get people and gene expression

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'allele_pheno': {'url': 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/ONTOLOGY/phenotype_association.WSNUMBER.wb', 'file': 'phenotype_association.wb'}, 'checksums': {'url': 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/CHECKSUMS', 'file': 'CHECKSUMS'}, 'disease_assoc': {'url': 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/ONTOLOGY/disease_association.WSNUMBER.wb', 'file': 'disease_association.wb'}, 'feature_loc': {'url': 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WSNUMBER.annotations.gff3.gz', 'file': 'c_elegans.PRJNA13758.annotations.gff3.gz'}, 'gene_ids': {'url': 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WSNUMBER.geneIDs.txt.gz', 'file': 'c_elegans.PRJNA13758.geneIDs.txt.gz'}, 'pub_xrefs': {'url': 'http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/generic.cgi?action=WpaXref', 'file': 'pub_xrefs.txt'}, 'rnai_pheno': {'url': 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/ONTOLOGY/rnai_phenotypes.WSNUMBER.wb', 'file': 'rnai_phenotypes.wb'}, 'xrefs': {'url': 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WSNUMBER.xrefs.txt.gz', 'file': 'c_elegans.PRJNA13758.xrefs.txt.gz'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

get_feature_type_by_class_and_biotype(ftype, biotype)
make_reagent_targeted_gene_id(gene_id, reagent_id)
parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

process_allele_phenotype(limit=None)

This file compactly lists variant to phenotype associations, such that in a single row, there may be >1 variant listed per phenotype and paper. This indicates that each variant is individually assocated with the given phenotype, as listed in 1+ papers. (Not that the combination of variants is producing the phenotype.) :param limit: :return:

process_disease_association(limit)
process_feature_loc(limit)
process_gene_desc(limit)
process_gene_ids(limit)
process_gene_interaction(limit)

The gene interaction file includes identified interactions, that are between two or more gene (products). In the case of interactions with >2 genes, this requires creating groups of genes that are involved in the interaction. From the wormbase help list: In the example WBInteraction000007779 it would likely be misleading to suggest that lin-12 interacts with (suppresses in this case) smo-1 ALONE or that lin-12 suppresses let-60 ALONE; the observation in the paper; see Table V in paper PMID:15990876 was that a lin-12 allele (heterozygous lin-12(n941/+)) could suppress the “multivulva” phenotype induced synthetically by simultaneous perturbation of BOTH smo-1 (by RNAi) AND let-60 (by the n2021 allele). So this is necessarily a three-gene interaction.

Therefore, we can create groups of genes based on their “status” of Effector | Effected.

Status: IN PROGRESS

Parameters:limit
Returns:
process_pub_xrefs(limit=None)
process_rnai_phenotypes(limit=None)
species = '/species/c_elegans/PRJNA13758'
test_ids = {'allele': ['WBVar00087800', 'WBVar00087742', 'WBVar00144481', 'WBVar00248869', 'WBVar00250630'], 'gene': ['WBGene00001414', 'WBGene00004967', 'WBGene00003916', 'WBGene00004397', 'WBGene00001531'], 'pub': [], 'strain': ['BA794', 'RK1', 'HE1006']}
update_wsnum_in_files(vernum)

With the given version number `vernum`, update the source’s version number, and replace in the file hashmap. the version number is in the CHECKSUMS file. :param vernum: :return:

wbdev = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release'
wbprod = 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release'
wbrel = 'ftp://ftp.wormbase.org/pub/wormbase/releases'
dipper.sources.ZFIN module
class dipper.sources.ZFIN.ZFIN(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

This is the parser for the [Zebrafish Model Organism Database (ZFIN)](http://www.zfin.org), from which we process genotype and phenotype data for laboratory zebrafish.

We generate the zfin graph to include the following information: * genes * sequence alterations (includes SNPs/del/ins/indel and large chromosomal rearrangements) * transgenic constructs * morpholinos, talens, crisprs as expression-affecting reagents * genotypes, and their components * fish (as comprised of intrinsic and extrinsic genotypes) * publications (and their mapping to PMIDs, if available) * genotype-to-phenotype associations (including environments and stages at which they are assayed) * environmental components * orthology to human genes * genetic positional information for genes and sequence alterations * fish-to-disease model associations

Genotypes leverage the GENO genotype model and include both intrinsic and extrinsic genotypes. Where necessary, we create anonymous nodes of the genotype partonomy (such as for variant single locus complements, genomic variation complements, variant loci, extrinsic genotypes, and extrinsic genotype parts).

Furthermore, we process the genotype components to build labels in a monarch-style. This leads to genotype labels that include: * all genes targeted by reagents (morphants, crisprs, etc), in addition to the ones that the reagent was designed against. * all affected genes within deficiencies * complex hets being listed as gene<mutation1>/gene<mutation2> rather than gene<mutation1>/+; gene<mutation2>/+

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'backgrounds': {'url': 'http://zfin.org/downloads/genotype_backgrounds.txt', 'file': 'genotype_backgrounds.txt'}, 'crispr': {'url': 'http://zfin.org/downloads/CRISPR.txt', 'file': 'CRISPR.txt'}, 'enviro': {'url': 'http://zfin.org/downloads/pheno_environment_fish.txt', 'file': 'pheno_environment_fish.txt'}, 'feature_affected_gene': {'url': 'http://zfin.org/downloads/features-affected-genes.txt', 'file': 'features-affected-genes.txt'}, 'features': {'url': 'http://zfin.org/downloads/features.txt', 'file': 'features.txt'}, 'fish_components': {'url': 'http://zfin.org/downloads/fish_components_fish.txt', 'file': 'fish_components_fish.txt'}, 'fish_disease_models': {'url': 'http://zfin.org/downloads/fish_model_disease.txt', 'file': 'fish_model_disease.txt'}, 'genbank': {'url': 'http://zfin.org/downloads/genbank.txt', 'file': 'genbank.txt'}, 'gene': {'url': 'http://zfin.org/downloads/gene.txt', 'file': 'gene.txt'}, 'gene_coordinates': {'url': 'http://zfin.org/downloads/E_zfin_gene_alias.gff3', 'file': 'E_zfin_gene_alias.gff3'}, 'gene_marker_rel': {'url': 'http://zfin.org/downloads/gene_marker_relationship.txt', 'file': 'gene_marker_relationship.txt'}, 'geno': {'url': 'http://zfin.org/downloads/genotype_features.txt', 'file': 'genotype_features.txt'}, 'human_orthos': {'url': 'http://zfin.org/downloads/human_orthos.txt', 'file': 'human_orthos.txt'}, 'mappings': {'url': 'http://zfin.org/downloads/mappings.txt', 'file': 'mappings.txt'}, 'morph': {'url': 'http://zfin.org/downloads/Morpholinos.txt', 'file': 'Morpholinos.txt'}, 'pheno': {'url': 'http://zfin.org/downloads/phenotype_fish.txt', 'file': 'phenotype_fish.txt'}, 'pub2pubmed': {'url': 'http://zfin.org/downloads/pub_to_pubmed_id_translation.txt', 'file': 'pub_to_pubmed_id_translation.txt'}, 'pubs': {'url': 'http://zfin.org/downloads/zfinpubs.txt', 'file': 'zfinpubs.txt'}, 'stage': {'url': 'http://zfin.org/Downloads/stage_ontology.txt', 'file': 'stage_ontology.txt'}, 'talen': {'url': 'http://zfin.org/downloads/TALEN.txt', 'file': 'TALEN.txt'}, 'uniprot': {'url': 'http://zfin.org/downloads/uniprot.txt', 'file': 'uniprot.txt'}, 'wild': {'url': 'http://zfin.org/downloads/wildtypes_fish.txt', 'file': 'wildtypes.txt'}, 'zpmap': {'url': 'http://compbio.charite.de/hudson/job/zp-owl-new/lastSuccessfulBuild/artifact/zp.annot_sourceinfo', 'file': 'zp-mapping.txt'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

get_orthology_evidence_code(abbrev)
get_orthology_sources_from_zebrafishmine()

Fetch the zfin gene to other species orthology annotations, together with the evidence for the assertion. Write the file locally to be read in a separate function. :return:

static make_targeted_gene_id(geneid, reagentid)
parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

process_fish(limit=None)

Fish give identifiers to the “effective genotypes” that we create. We can match these by: Fish = (intrinsic) genotype + set of morpholinos

We assume here that the intrinsic genotypes and their parts will be processed separately, prior to calling this function.

Parameters:limit
Returns:
process_fish_disease_models(limit=None)
process_orthology_evidence(limit)
scrub()

Perform various data-scrubbing on the raw data files prior to parsing. For this resource, this currently includes: * remove oddities where there are “” instead of empty strings :return: None

test_ids = {'allele': ['ZDB-ALT-010426-4', 'ZDB-ALT-010427-8', 'ZDB-ALT-011017-8', 'ZDB-ALT-051005-2', 'ZDB-ALT-051227-8', 'ZDB-ALT-060221-2', 'ZDB-ALT-070314-1', 'ZDB-ALT-070409-1', 'ZDB-ALT-070420-6', 'ZDB-ALT-080528-1', 'ZDB-ALT-080528-6', 'ZDB-ALT-080827-15', 'ZDB-ALT-080908-7', 'ZDB-ALT-090316-1', 'ZDB-ALT-100519-1', 'ZDB-ALT-111024-1', 'ZDB-ALT-980203-1374', 'ZDB-ALT-980203-412', 'ZDB-ALT-980203-465', 'ZDB-ALT-980203-470', 'ZDB-ALT-980203-605', 'ZDB-ALT-980413-636', 'ZDB-ALT-021021-2', 'ZDB-ALT-080728-1', 'ZDB-ALT-100729-1', 'ZDB-ALT-980203-1560', 'ZDB-ALT-001127-6', 'ZDB-ALT-001129-2', 'ZDB-ALT-980203-1091', 'ZDB-ALT-070118-2', 'ZDB-ALT-991005-33', 'ZDB-ALT-020918-2', 'ZDB-ALT-040913-6', 'ZDB-ALT-980203-1827', 'ZDB-ALT-090504-6', 'ZDB-ALT-121218-1'], 'environment': ['ZDB-EXP-050202-1', 'ZDB-EXP-071005-3', 'ZDB-EXP-071227-14', 'ZDB-EXP-080428-1', 'ZDB-EXP-080428-2', 'ZDB-EXP-080501-1', 'ZDB-EXP-080805-7', 'ZDB-EXP-080806-5', 'ZDB-EXP-080806-8', 'ZDB-EXP-080806-9', 'ZDB-EXP-081110-3', 'ZDB-EXP-090505-2', 'ZDB-EXP-100330-7', 'ZDB-EXP-100402-1', 'ZDB-EXP-100402-2', 'ZDB-EXP-100422-3', 'ZDB-EXP-100511-5', 'ZDB-EXP-101025-12', 'ZDB-EXP-101025-13', 'ZDB-EXP-110926-4', 'ZDB-EXP-110927-1', 'ZDB-EXP-120809-5', 'ZDB-EXP-120809-7', 'ZDB-EXP-120809-9', 'ZDB-EXP-120913-5', 'ZDB-EXP-130222-13', 'ZDB-EXP-130222-7', 'ZDB-EXP-130904-2', 'ZDB-EXP-041102-1', 'ZDB-EXP-140822-13', 'ZDB-EXP-041102-1', 'ZDB-EXP-070129-3', 'ZDB-EXP-110929-7', 'ZDB-EXP-100520-2', 'ZDB-EXP-100920-3', 'ZDB-EXP-100920-5', 'ZDB-EXP-090601-2', 'ZDB-EXP-151116-3'], 'fish': ['ZDB-FISH-150901-17912', 'ZDB-FISH-150901-18649', 'ZDB-FISH-150901-26314', 'ZDB-FISH-150901-9418', 'ZDB-FISH-150901-14591', 'ZDB-FISH-150901-9997', 'ZDB-FISH-150901-23877', 'ZDB-FISH-150901-22128', 'ZDB-FISH-150901-14869', 'ZDB-FISH-150901-6695', 'ZDB-FISH-150901-24158', 'ZDB-FISH-150901-3631', 'ZDB-FISH-150901-20836', 'ZDB-FISH-150901-1060', 'ZDB-FISH-150901-8451', 'ZDB-FISH-150901-2423', 'ZDB-FISH-150901-20257', 'ZDB-FISH-150901-10002', 'ZDB-FISH-150901-12520', 'ZDB-FISH-150901-14833', 'ZDB-FISH-150901-2104', 'ZDB-FISH-150901-6607', 'ZDB-FISH-150901-1409'], 'gene': ['ZDB-GENE-000616-6', 'ZDB-GENE-000710-4', 'ZDB-GENE-030131-2773', 'ZDB-GENE-030131-8769', 'ZDB-GENE-030219-146', 'ZDB-GENE-030404-2', 'ZDB-GENE-030826-1', 'ZDB-GENE-030826-2', 'ZDB-GENE-040123-1', 'ZDB-GENE-040426-1309', 'ZDB-GENE-050522-534', 'ZDB-GENE-060503-719', 'ZDB-GENE-080405-1', 'ZDB-GENE-081211-2', 'ZDB-GENE-091118-129', 'ZDB-GENE-980526-135', 'ZDB-GENE-980526-166', 'ZDB-GENE-980526-196', 'ZDB-GENE-980526-265', 'ZDB-GENE-980526-299', 'ZDB-GENE-980526-41', 'ZDB-GENE-980526-437', 'ZDB-GENE-980526-44', 'ZDB-GENE-980526-481', 'ZDB-GENE-980526-561', 'ZDB-GENE-980526-89', 'ZDB-GENE-990415-181', 'ZDB-GENE-990415-72', 'ZDB-GENE-990415-75', 'ZDB-GENE-980526-44', 'ZDB-GENE-030421-3', 'ZDB-GENE-980526-196', 'ZDB-GENE-050320-62', 'ZDB-GENE-061013-403', 'ZDB-GENE-041114-104', 'ZDB-GENE-030131-9700', 'ZDB-GENE-031114-1', 'ZDB-GENE-990415-72', 'ZDB-GENE-030131-2211', 'ZDB-GENE-030131-3063', 'ZDB-GENE-030131-9460', 'ZDB-GENE-980526-26', 'ZDB-GENE-980526-27', 'ZDB-GENE-980526-29', 'ZDB-GENE-071218-6', 'ZDB-GENE-070912-423', 'ZDB-GENE-011207-1', 'ZDB-GENE-980526-284', 'ZDB-GENE-980526-72', 'ZDB-GENE-991129-7', 'ZDB-GENE-000607-83', 'ZDB-GENE-090504-2'], 'genotype': ['ZDB-GENO-010426-2', 'ZDB-GENO-010427-3', 'ZDB-GENO-010427-4', 'ZDB-GENO-050209-30', 'ZDB-GENO-051018-1', 'ZDB-GENO-070209-80', 'ZDB-GENO-070215-11', 'ZDB-GENO-070215-12', 'ZDB-GENO-070228-3', 'ZDB-GENO-070406-1', 'ZDB-GENO-070712-5', 'ZDB-GENO-070917-2', 'ZDB-GENO-080328-1', 'ZDB-GENO-080418-2', 'ZDB-GENO-080516-8', 'ZDB-GENO-080606-609', 'ZDB-GENO-080701-2', 'ZDB-GENO-080713-1', 'ZDB-GENO-080729-2', 'ZDB-GENO-080804-4', 'ZDB-GENO-080825-3', 'ZDB-GENO-091027-1', 'ZDB-GENO-091027-2', 'ZDB-GENO-091109-1', 'ZDB-GENO-100325-3', 'ZDB-GENO-100325-4', 'ZDB-GENO-100325-5', 'ZDB-GENO-100325-6', 'ZDB-GENO-100524-2', 'ZDB-GENO-100601-2', 'ZDB-GENO-100910-1', 'ZDB-GENO-111025-3', 'ZDB-GENO-120522-18', 'ZDB-GENO-121210-1', 'ZDB-GENO-130402-5', 'ZDB-GENO-980410-268', 'ZDB-GENO-080307-1', 'ZDB-GENO-960809-7', 'ZDB-GENO-990623-3', 'ZDB-GENO-130603-1', 'ZDB-GENO-001127-3', 'ZDB-GENO-001129-1', 'ZDB-GENO-090203-8', 'ZDB-GENO-070209-1', 'ZDB-GENO-070118-1', 'ZDB-GENO-140529-1', 'ZDB-GENO-070820-1', 'ZDB-GENO-071127-3', 'ZDB-GENO-000209-20', 'ZDB-GENO-980202-1565', 'ZDB-GENO-010924-10', 'ZDB-GENO-010531-2', 'ZDB-GENO-090504-5', 'ZDB-GENO-070215-11', 'ZDB-GENO-121221-1'], 'morpholino': ['ZDB-MRPHLNO-041129-1', 'ZDB-MRPHLNO-041129-2', 'ZDB-MRPHLNO-041129-3', 'ZDB-MRPHLNO-050308-1', 'ZDB-MRPHLNO-050308-3', 'ZDB-MRPHLNO-060508-2', 'ZDB-MRPHLNO-070118-1', 'ZDB-MRPHLNO-070522-3', 'ZDB-MRPHLNO-070706-1', 'ZDB-MRPHLNO-070725-1', 'ZDB-MRPHLNO-070725-2', 'ZDB-MRPHLNO-071005-1', 'ZDB-MRPHLNO-071227-1', 'ZDB-MRPHLNO-080307-1', 'ZDB-MRPHLNO-080428-1', 'ZDB-MRPHLNO-080430-1', 'ZDB-MRPHLNO-080919-4', 'ZDB-MRPHLNO-081110-3', 'ZDB-MRPHLNO-090106-5', 'ZDB-MRPHLNO-090114-1', 'ZDB-MRPHLNO-090505-1', 'ZDB-MRPHLNO-090630-11', 'ZDB-MRPHLNO-090804-1', 'ZDB-MRPHLNO-100728-1', 'ZDB-MRPHLNO-100823-6', 'ZDB-MRPHLNO-101105-3', 'ZDB-MRPHLNO-110323-3', 'ZDB-MRPHLNO-111104-5', 'ZDB-MRPHLNO-130222-4', 'ZDB-MRPHLNO-080430', 'ZDB-MRPHLNO-100823-6', 'ZDB-MRPHLNO-140822-1', 'ZDB-MRPHLNO-100520-4', 'ZDB-MRPHLNO-100520-5', 'ZDB-MRPHLNO-100920-3', 'ZDB-MRPHLNO-050604-1', 'ZDB-CRISPR-131113-1', 'ZDB-MRPHLNO-140430-12', 'ZDB-MRPHLNO-140430-13'], 'pub': ['PMID:11566854', 'PMID:12588855', 'PMID:12867027', 'PMID:14667409', 'PMID:15456722', 'PMID:16914492', 'PMID:17374715', 'PMID:17545503', 'PMID:17618647', 'PMID:17785424', 'PMID:18201692', 'PMID:18358464', 'PMID:18388326', 'PMID:18638469', 'PMID:18846223', 'PMID:19151781', 'PMID:19759004', 'PMID:19855021', 'PMID:20040115', 'PMID:20138861', 'PMID:20306498', 'PMID:20442775', 'PMID:20603019', 'PMID:21147088', 'PMID:21893049', 'PMID:21925157', 'PMID:22718903', 'PMID:22814753', 'PMID:22960038', 'PMID:22996643', 'PMID:23086717', 'PMID:23203810', 'PMID:23760954', 'ZFIN:ZDB-PUB-140303-33', 'ZFIN:ZDB-PUB-140404-9', 'ZFIN:ZDB-PUB-080902-16', 'ZFIN:ZDB-PUB-101222-7', 'ZFIN:ZDB-PUB-140614-2', 'ZFIN:ZDB-PUB-120927-26', 'ZFIN:ZDB-PUB-100504-5', 'ZFIN:ZDB-PUB-140513-341']}
dipper.sources.ZFINSlim module
class dipper.sources.ZFINSlim.ZFINSlim(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.Source.Source

zfin mgi model only containing Gene to phenotype associations Using the file here: https://zfin.org/downloads/phenoGeneCleanData_fish.txt

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'g2p_clean': {'url': 'https://zfin.org/downloads/phenoGeneCleanData_fish.txt', 'file': 'phenoGeneCleanData_fish.txt.txt'}, 'zpmap': {'url': 'http://compbio.charite.de/hudson/job/zp-owl-new/lastSuccessfulBuild/artifact/zp.annot_sourceinfo', 'file': 'zp-mapping.txt'}}
parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

dipper.utils package
Submodules
dipper.utils.CurieUtil module
class dipper.utils.CurieUtil.CurieUtil(curie_map)

Bases: object

Create compact URI

get_base()
get_curie(uri)

Get a CURIE from a URI

get_curie_prefix(uri)

Return the CURIE’s prefix:

get_uri(curie)

Get a URI from a CURIE

prefix_exists(pfx)
dipper.utils.DipperUtil module
class dipper.utils.DipperUtil.DipperUtil

Bases: object

Various utilities and quick methods used in this application

(A little too quick) Per: https://www.ncbi.nlm.nih.gov/books/NBK25497/ NCBI recommends that users post

no more than three URL requests per second and limit large jobs to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays

restructuring to make bulk queries is less likely to result in another ban for peppering them with one offs

static get_homologene_by_gene_num(gene_num)
static get_ncbi_id_from_symbol(gene_symbol)

Get ncbi gene id from symbol using monarch and mygene services :param gene_symbol: :return:

static get_ncbi_taxon_num_by_label(label)

Here we want to look up the NCBI Taxon id using some kind of label. It will only return a result if there is a unique hit.

Returns:
static is_omim_disease(gene_id)

Process omim equivalencies by examining the monarch ontology scigraph As an alternative we could examine mondo.owl, since the ontology scigraph imports the output of this script which creates an odd circular dependency (even though we’re querying mondo.owl through scigraph)

Parameters:
  • graph – rdfLib graph object
  • gene_id – ncbi gene id as curie
  • omim_id – omim id as curie
Returns:

None

remove_control_characters(s)
Filters out charcters in any of these unicode catagories [Cc] Other, Control ( 65 characters)
, …
[Cf] Other, Format (151 characters) [Cn] Other, Not Assigned ( 0 characters – none have this property) [Co] Other, Private Use ( 6 characters) [Cs] Other, Surrogate ( 6 characters)
dipper.utils.GraphUtils module
class dipper.utils.GraphUtils.GraphUtils(curie_map)

Bases: object

static add_property_axioms(graph, properties)
static add_property_to_graph(results, graph, property_type, property_list)
digest_id()

Form a deterministic digest of input Leading ‘b’ is an experiment forcing the first char to be non numeric but valid hex Not required for RDF but some other contexts do not want the leading char to be a digit

: param str wordage arbitrary string : return str

static get_properties_from_graph(graph)

Wrapper for RDFLib.graph.predicates() that returns a unique set :param graph: RDFLib.graph :return: set, set of properties

write(graph, fileformat=None, file=None)

A basic graph writer (to stdout) for any of the sources. this will write raw triples in rdfxml, unless specified. to write turtle, specify format=’turtle’ an optional file can be supplied instead of stdout :return: None

dipper.utils.TestUtils module
class dipper.utils.TestUtils.TestUtils

Bases: object

test_graph_equality(turtlish, graph)
Parameters:
  • turtlish – String of triples in turtle format without prefix header
  • graph – Graph object to test against
Returns:

Boolean, True if graphs contain same set of triples

dipper.utils.pysed module
dipper.utils.pysed.replace(oldstr, newstr, infile, dryrun=False)

Sed-like Replace function.. Usage: pysed.replace(<Old string>, <Replacement String>, <Text File>) Example: pysed.replace(‘xyz’, ‘XYZ’, ‘/path/to/file.txt’)

This will dump the output to STDOUT instead of changing the input file. Example ‘DRYRUN’: pysed.replace(‘xyz’, ‘XYZ’, ‘/path/to/file.txt’, dryrun=True)

dipper.utils.pysed.rmlinematch(oldstr, infile, dryrun=False)

Sed-like line deletion function based on given string.. Usage: pysed.rmlinematch(<Unwanted string>, <Text File>) Example: pysed.rmlinematch(‘xyz’, ‘/path/to/file.txt’) Example: ‘DRYRUN’: pysed.rmlinematch(‘xyz’, ‘/path/to/file.txt’, dryrun=True) This will dump the output to STDOUT instead of changing the input file.

dipper.utils.pysed.rmlinenumber(linenumber, infile, dryrun=False)

Sed-like line deletion function based on given line number.. Usage: pysed.rmlinenumber(<Unwanted Line Number>, <Text File>) Example: pysed.rmlinenumber(10, ‘/path/to/file.txt’) Example ‘DRYRUN’: pysed.rmlinenumber(10, ‘/path/to/file.txt’, dryrun=True) #This will dump the output to STDOUT instead of changing the input file.

dipper.utils.romanplus module
exception dipper.utils.romanplus.InvalidRomanNumeralError

Bases: dipper.utils.romanplus.RomanError

exception dipper.utils.romanplus.NotIntegerError

Bases: dipper.utils.romanplus.RomanError

exception dipper.utils.romanplus.OutOfRangeError

Bases: dipper.utils.romanplus.RomanError

exception dipper.utils.romanplus.RomanError

Bases: Exception

dipper.utils.romanplus.fromRoman(s)

convert Roman numeral to integer

dipper.utils.romanplus.toRoman(n)

convert integer to Roman numeral

Submodules
dipper.config module
dipper.config.conf = {'keys': {'omim': ''}}

Load the configuration file ‘conf.json’, if it exists. it isn’t always required, but may be for some sources. conf.json may contain sensitive info and should not live in a public repo

dipper.config.get_config()
dipper.curie_map module

Acroname central

Load the curie mapping file ‘curie_map.yaml’, it is necessary for most resources

dipper.curie_map.get()
dipper.curie_map.get_base()

Source APIs

Indices and tables