Welcome to Dipper’s documentation!¶
Dipper is a Python package to generate RDF triples from common scientific resources. Dipper includes subpackages and modules to create graphical models of this data, including:
- Models package for generating common sets of triples, including common OWL axioms, complex genotypes, associations, evidence and provenance models.
- Graph package for building graphs with RDFLib or streaming n-triples
- Source package containing fetchers and parsers that interface with remote databases and web services
Getting started¶
Installing, running, and the basics
Installation¶
Dipper requires Python version 3.5 or higher
Install with pip:
pip install dipper
Development version¶
The development version can be pulled from GitHub.
pip3 install git+git://github.com/monarch-initiative/dipper.git
Building locally¶
git clone https://github.com/monarch-initiative/dipper.git
cd dipper
pip install .
Alternatively, a subset of source specific requirements may be downloaded. To download the core requirements:
pip install -r requirements.txt
To download source specific requirements use the requirements/ directory, for example:
pip install -r requirements/mgi.txt
To download requirements for all sources:
pip install -r requirements/all-sources.txt
Getting started with Dipper¶
This guide assumes you have already installed dipper. If not, then follow the steps in the Installation section.
Command line¶
You can run the code by supplying a list of one or more sources on the command line. some examples:
dipper-etl.py --sources omim,ncbigene
Furthermore, you can check things out by supplying a limit. this will only process the first N number of rows or data elements:
dipper-etl.py --sources hpoa --limit 100
Other command line parameters are explained if you request help:
dipper-etl.py --help
Notebooks¶
We provide Jupyter Notebooks to illustrate the functionality of the python library. These can also be used interactively.
See the Notebooks section for more details.
Building models¶
This code example shows some of the basics of building RDF graphs using the models API:
import pandas as pd
from dipper.graph.RDFGraph import RDFGraph
from dipper.models.Model import Model
columns = ['variant', 'variant_label', 'variant_type',
'phenotype','relation', 'source', 'evidence', 'dbxref']
data = [
['ClinVarVariant:254143', 'C326F', 'SO:0000694',
'HP:0000504','RO:0002200', 'PMID:12503095', 'ECO:0000220',
'dbSNP:886037891']
]
# Initialize graph and model
graph = RDFGraph()
model = Model(graph)
# Read file
dataframe = pd.DataFrame(data=data, columns=columns)
for index, row in dataframe.iterrows():
# Add the triple ClinVarVariant:254143 RO:0002200 HP:0000504
# RO:0002200 is the has_phenotype relation
# HP:0000748 is the phenotype 'Inappropriate laughter'
model.addTriple(row['variant'], row['relation'], row['phenotype'])
# The addLabel method adds a label using the rdfs:label relation
model.addLabel(row['variant'], row['variant_label'])
# addType makes the variant an individual of a class,
# in this case SO:0000694 'SNP'
model.addType(row['variant'], row['variant_type'])
# addXref uses the relation OIO:hasDbXref
model.addXref(row['variant'], row['dbxref'])
# Serialize the graph as turtle
print(graph.serialize(format='turtle').decode("utf-8"))
For more information see the Working with the models API section.
Notebooks¶
Running jupyter locally¶
Follow the instructions for installing from GitHub in Installation. Then start a notebook browser with:
pip install jupyter
PYTHONPATH=. jupyter notebook ./docs/notebooks
Downloads¶
RDF¶
The dipper output is quality checked and released on a regular basis. The latest release can be found here:
The output from our development branch are made available here (may contain errors):
TSV¶
TSV downloads for common queries can be found here:
Neo4J¶
A dump of our Neo4J database that includes the output from dipper plus various ontologies:
A public version can be accessed via the SciGraph REST API:
Ingest status¶
We use Jenkins to periodically build each source. A dashboard containing the current status of each ingest can be found here:
Deeper into Dipper¶
A look into the structure of the codebase and how to write ingests
Working with graphs¶
The Dipper graph package provides two graph implementations, a RDFGraph which is an extension of the RDFLib [1] Graph [2], and a StreamedGraph which prints triples to standard out in the ntriple format.
RDFGraphs¶
The RDFGraph class reads the curie_map.yaml file and converts strings formatted as curies to RDFLib URIRefs. Triples are added via the addTriple method, for example:
from dipper.graph.RDFGraph import RDFGraph
graph = RDFGraph()
graph.addTriple('foaf:John', 'foaf:knows', 'foaf:Joseph')
The graph can then be serialized in a variety of formats using RDFLib [3]:
from dipper.graph.RDFGraph import RDFGraph
graph = RDFGraph()
graph.addTriple('foaf:John', 'foaf:knows', 'foaf:Joseph')
print(graph.serialize(format='turtle').decode("utf-8"))
# Or write to file
graph.serialize(destination="/path/to/output.ttl", format='turtle')
Prints:
@prefix OBO: <http://purl.obolibrary.org/obo/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
foaf:John foaf:knows foaf:Joseph .
When an object is a literal, set the object_is_literal param to True
from dipper.graph.RDFGraph import RDFGraph
graph = RDFGraph()
graph.addTriple('foaf:John', 'rdfs:label', 'John', object_is_literal=True)
Literal types can also be passed into the method:
from dipper.graph.RDFGraph import RDFGraph
graph = RDFGraph()
graph.addTriple(
'foaf:John', 'foaf:age', 12,
object_is_literal=True, literal_type="xsd:integer"
)
StreamedGraphs¶
StreamedGraphs print triples as they are processed by the addTriple method. This is useful for large sources where. The output should be sorted and uniquified as there is no checking for duplicate triples. For example:
from dipper.graph.StreamedGraph import StreamedGraph
graph = StreamedGraph()
graph.addTriple('foaf:John', 'foaf:knows', 'foaf:Joseph')
Prints:
<http://xmlns.com/foaf/0.1/John> <http://xmlns.com/foaf/0.1/knows> <http://xmlns.com/foaf/0.1/Joseph> .
References¶
[1] | RDFLib: http://rdflib.readthedocs.io/en/stable/ |
[2] | RDFLib Graphs: https://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#graph-module |
[3] | RDFLib Serializing: http://rdflib.readthedocs.io/en/stable/apidocs/rdflib.html#rdflib.graph.Graph.serialize |
Working with the models API¶
The models package provides classes for building common sets of triples based on our modeling patterns.
For an example see the notebook on this topic: Building graphs with the model API
Basics¶
The models class provides methods for building common RDF and OWL statements
For a list of methods, see the API docs.
Building associations¶
We use the RDF Reification [1] pattern to create ternary statements, for example, adding frequency data to phenotype to disease associations. We utilize the Open Biomedical Association ontology [2] to reify statements, and the SEPIO ontology to add evidence and provenance.
For a list of classes and methods, see the API docs.
Building genotypes¶
We use the GENO ontology [4] to build complex genotypes and their parts.
For a list of methods, see the API docs.
GENO docs: The Genotype Ontology (GENO)
Building complex evidence and provenance graphs¶
We use the SEPIO ontology to build complex evidence and provenance. For an example see the IMPC source ingest.
For a list of methods, see the API docs for evidence and provenance.
SEPIO docs: The Scientific Evidence and Provenance Information Ontology (SEPIO)
References¶
[1] | RDF Reification: https://www.w3.org/TR/rdf-primer/#reification |
[2] | OBAN: https://github.com/EBISPOT/OBAN |
[3] | SEPIO: https://github.com/monarch-initiative/SEPIO-ontology |
[4] | GENO: https://github.com/monarch-initiative/GENO-ontology |
Writing ingests with the source API¶
Overview¶
Although not required to write an ingest, we have provided a source parent class that can be extended to leverage reusable functionality in each ingest.
To create a new ingest using this method, first extend the Source class.
If the source contains flat files, include a files dictionary with this structure:
files = {
'somekey': {
'file': 'filename.tsv',
'url': 'http://example.org/filename.tsv'
},
...
}
For example:
from dipper.sources.Source import Source
class TPO(Source):
"""
The ToxicoPhenomicOmicsDB contains data on ...
"""
files = {
'genes': {
'file': 'genes.tsv',
'url': 'http://example.org/genes.tsv'
}
}
Initializing the class¶
Each source class takes a graph_type (string) and are_bnodes_skolemized (boolean) parameters. These parameters are used to initialize a graph object in the Source constructor.
Note: In the future this may be adjusted so that a graph object is passed into each source.
For example:
def __init__(self, graph_type, are_bnodes_skolemized):
super().__init__(graph_type, are_bnodes_skolemized, 'TPO')
Writing the fetcher¶
This method is intended to fetch data from the remote locations (if it is newer than the local copy).
Extend the parent fetch function. If a the remote file has already been downloaded. The fetch method checks the remote headers to see if it has been updated. For sources not served over HTTP, this method may need to be overriden, for example in Bgee.
For example:
def fetch(self, is_dl_forced=False):
"""
Fetches files from TPO
:param is_dl_forced (bool): Force download
:return None
"""
self.get_files(is_dl_forced)
Writing the parser¶
Typically these are written by looping through the series of files that were obtained by the fetch method. The goal is to process each file minimally, adding classes and individuals as necessary, and adding triples to the sources’ graph.
For example:
def parse(self, limit=None):
"""
Parses genes from TPO
:param limit (int, optional) limit the number of rows processed
:return None
"""
if limit is not None:
logger.info("Only parsing first %d rows", limit)
# Open file
fh = open('/'.join((self.rawdir, self.files['genes']['file'])), 'r')
# Parse file
self._add_gene_toxicology(fh, limit)
# Close file
fh.close()
Considerations when writing a parser¶
There are certain conventions that we follow when parsing data:
1. Genes are a special case of genomic feature that are added as (OWL) Classes. But all other genomic features are added as individuals of an owl class.
2. If a source references an external identifier then, assume that it has been processed in another source script, and only add the identifier (but not the label) to it within this source’s file. This will help prevent label collisions related to slightly different versions of the source data when integrating downstream.
3. You can instantiate a class or individual as many times as you want; they will get merged in the graph and will only show up once in the resulting output.
Testing ingests¶
Unit tests¶
Unit style tests can be achieved by mocking source classes (or specific functions) and testing single functions. The test_graph_equality function can be used to test graph equality by supplying a string formatted as headless (no prefixes) turtle and a graph object. Most dipper methods are not pure functions, and rely on side effects to a graph object. Therefore it is best to clear the graph object before any testing logic, eg:
from dipper.utils.TestUtils import TestUtils
source.graph = RDFGraph(True) # Reset graph
test_util = TestUtils()
source.run_some_function()
expected_triples = """
foaf:person1 foaf:knows foaf:person2 .
"""
self.assertTrue(self.test_util.test_graph_equality(
expected_triples, source.graph))
Integration tests¶
Integration tests can be executed by generating a file that contains a subset of a source’s data in the same format, and running it through the source.parse() method, serializing the graph, and then testing this file in some other piece of code or database.
You may see testing code within source classes, but these tests will be deleted or refactored and moved to the test directory.
Configuring dipper with keys and passwords¶
Add private configuration parameters into your private conf.json file. Examples of items to put into the config include:
- database connection parameters (in the “dbauth” object)
- ftp login credentials
- api keys (in the “keys” object)
These are organized such that within any object (dbauth, keys, etc), they are keyed again by the source’s name.
Here is an example:
{
"keys": {
"omim" : "foo",
},
"dbauth" : {
"mgi" : {
"user" : "bar",
"password" : "baz"
}
}
This file must be placed in the dipper package directory and named conf.json. If building locally this is in the dipper/dipper/ directory. If installed with pip this will be in path/to/env/lib/python3.x/site-packages/dipper/ directory.
Schemas¶
Although RDF is inherently schemaless, we aim to construct consistent models across sources. This allows us to build source agnostic queries and bridge data across sources.
The dipper schemas are documented as directed graphs. Examples can be found in the ingest artifacts repo.
Some ontologies contain documentation on how to describe data using the classes and properties defined in the ontology:
While not yet implemented, in the future we plan on defining our schemas and constraints using the BioLink model specification.
The cypher queries that we use to cache inferred and direct relationships between entities are stored in GitHub.
For developers¶
API Docs¶
dipper package¶
Subpackages¶
dipper.graph package¶
-
class
dipper.graph.RDFGraph.
RDFGraph
(are_bnodes_skized=True, identifier=None)¶ Bases:
rdflib.graph.ConjunctiveGraph
,dipper.graph.Graph.Graph
Extends RDFLibs ConjunctiveGraph The goal of this class is wrap the creation of triples and manage creation of URIRef, Bnodes, and literals from an input curie
-
addTriple
(subject_id, predicate_id, obj, object_is_literal=False, literal_type=None)¶
-
bind_all_namespaces
()¶
-
curie_util
= <dipper.utils.CurieUtil.CurieUtil object>¶
-
skolemizeBlankNode
(curie)¶
-
-
class
dipper.graph.StreamedGraph.
StreamedGraph
(are_bnodes_skized=True, file_handle=None, fmt='nt')¶ Bases:
dipper.graph.Graph.Graph
Stream rdf triples to file or stdout Assumes a downstream process will sort then uniquify triples
Theoretically could support both ntriple, rdfxml formats, for now just support nt
-
addTriple
(subject_id, predicate_id, object_id, object_is_literal=False, literal_type=None)¶
-
curie_util
= <dipper.utils.CurieUtil.CurieUtil object>¶
-
serialize
(subject_iri, predicate_iri, obj, object_is_literal=False, literal_type=None)¶
-
skolemizeBlankNode
(curie)¶
-
dipper.models package¶
-
class
dipper.models.assoc.Association.
Assoc
(graph, definedby, sub=None, obj=None, pred=None)¶ Bases:
object
A base class for OBAN (Monarch)-style associations, to enable attribution of source and evidence on statements.
-
add_association_to_graph
()¶
-
add_date
(date)¶
-
add_evidence
(identifier)¶ Add an evidence code to the association object (maintained as a list) :param identifier:
Returns:
-
add_predicate_object
(predicate, object_node, object_type=None, datatype=None)¶
-
add_provenance
(identifier)¶
-
add_source
(identifier)¶ Add a source identifier (such as publication id) to the association object (maintained as a list) TODO we need to greatly expand this function!
Parameters: identifier – Returns:
-
annotation_properties
= {'consider': 'OIO:consider', 'definition': 'IAO:0000115', 'hasExactSynonym': 'OIO:hasExactSynonym', 'hasRelatedSynonym': 'OIO:hasRelatedSynonym', 'has_xref': 'OIO:hasDbXref', 'inchi_key': 'CHEBI:InChIKey', 'probabalistic_quantifier': 'GENO:0000867', 'replaced_by': 'IAO:0100001'}¶
-
assoc_types
= {'association': 'OBAN:association'}¶
-
datatype_properties
= {'created_on': 'pav:createdOn', 'has_measurement': 'IAO:0000004', 'has_quantifier': 'GENO:0000866', 'position': 'faldo:position'}¶
-
get_association_id
()¶
-
get_properties
()¶
-
static
make_association_id
(definedby, subject, predicate, object, attributes=None)¶ A method to create unique identifiers for OBAN-style associations, based on all the parts of the association If any of the items is empty or None, it will convert it to blank. It effectively digests the string of concatonated values. Subclasses of Assoc can submit an additional array of attributes that will be appeded to the ID.
Parameters: - definedby – The (data) resource that provided the annotation
- subject –
- predicate –
- object –
- attributes –
Returns:
-
object_properties
= {'causes_or_contributes': 'RO:0003302', 'expressed_in': 'RO:0002206', 'has disposition': 'RO:0000091', 'has_evidence': 'RO:0002558', 'has_object': 'OBAN:association_has_object', 'has_phenotype': 'RO:0002200', 'has_predicate': 'OBAN:association_has_predicate', 'has_provenance': 'OBAN:has_provenance', 'has_quality': 'RO:0000086', 'has_source': 'dc:source', 'has_subject': 'OBAN:association_has_subject', 'in_taxon': 'RO:0002162', 'is_about': 'IAO:0000136', 'towards': 'RO:0002503'}¶
-
properties
= {'causes_or_contributes': 'RO:0003302', 'consider': 'OIO:consider', 'created_on': 'pav:createdOn', 'definition': 'IAO:0000115', 'expressed_in': 'RO:0002206', 'has disposition': 'RO:0000091', 'hasExactSynonym': 'OIO:hasExactSynonym', 'hasRelatedSynonym': 'OIO:hasRelatedSynonym', 'has_evidence': 'RO:0002558', 'has_measurement': 'IAO:0000004', 'has_object': 'OBAN:association_has_object', 'has_phenotype': 'RO:0002200', 'has_predicate': 'OBAN:association_has_predicate', 'has_provenance': 'OBAN:has_provenance', 'has_quality': 'RO:0000086', 'has_quantifier': 'GENO:0000866', 'has_source': 'dc:source', 'has_subject': 'OBAN:association_has_subject', 'has_xref': 'OIO:hasDbXref', 'in_taxon': 'RO:0002162', 'inchi_key': 'CHEBI:InChIKey', 'is_about': 'IAO:0000136', 'position': 'faldo:position', 'probabalistic_quantifier': 'GENO:0000867', 'replaced_by': 'IAO:0100001', 'towards': 'RO:0002503'}¶
-
set_association_id
(assoc_id=None)¶ This will set the association ID based on the internal parts of the association. To be used in cases where an external association identifier should be used.
Parameters: assoc_id – Returns:
-
set_description
(description)¶
-
set_object
(identifier)¶
-
set_relationship
(identifier)¶
-
set_score
(score, unit=None, score_type=None)¶
-
set_subject
(identifier)¶
-
-
class
dipper.models.assoc.Chem2DiseaseAssoc.
Chem2DiseaseAssoc
(graph, definedby, chem_id, phenotype_id, rel_id=None)¶ Bases:
dipper.models.assoc.Association.Assoc
Attributes: assoc_id (str): Association Curie (Prefix:ID) chem_id (str): Chemical Curie phenotype_id (str): Phenotype Curie pub_list (str,list): One or more publication curies rel (str): Property relating assoc_id and chem_id evidence (str): Evidence curie
-
make_c2p_assoc_id
()¶
-
set_association_id
(assoc_id=None)¶ This will set the association ID based on the internal parts of the association. To be used in cases where an external association identifier should be used.
Parameters: assoc_id – Returns:
-
-
class
dipper.models.assoc.D2PAssoc.
D2PAssoc
(graph, definedby, disease_id, phenotype_id, onset=None, frequency=None, rel=None)¶ Bases:
dipper.models.assoc.Association.Assoc
A specific association class for defining Disease-to-Phenotype relationships This assumes that a graph is created outside of this class, and nodes get added. By default, an association will assume the “has_phenotype” relationship, unless otherwise specified.
-
add_association_to_graph
()¶ The reified relationship between a disease and a phenotype is decorated with some provenance information. This makes the assumption that both the disease and phenotype are classes.
Parameters: g – Returns:
-
d2p_object_properties
= {'frequency': ':frequencyOfPhenotype', 'onset': ':onset'}¶
-
make_d2p_id
()¶ Make an association id for phenotypic associations with disease that is defined by: source of association + disease + relationship + phenotype + onset + frequency
Returns:
-
set_association_id
(assoc_id=None)¶ This will set the association ID based on the internal parts of the association. To be used in cases where an external association identifier should be used.
Parameters: assoc_id – Returns:
-
-
class
dipper.models.assoc.DispositionAssoc.
DispositionAssoc
(graph, definedby, entity_id, heritability_id)¶ Bases:
dipper.models.assoc.Association.Assoc
A specific Association model for Heritability annotations. These are to be used between diseases and a heritability disposition.
-
class
dipper.models.assoc.G2PAssoc.
G2PAssoc
(graph, definedby, entity_id, phenotype_id, rel=None)¶ Bases:
dipper.models.assoc.Association.Assoc
A specific association class for defining Genotype-to-Phenotype relationships. This assumes that a graph is created outside of this class, and nodes get added. By default, an association will assume the “has_phenotype” relationship, unless otherwise specified. Note that genotypes are expected to be created and defined outside of this association, most likely by calling methods in the Genotype() class.
-
add_association_to_graph
()¶ Overrides Association by including bnode support
The reified relationship between a genotype (or any genotype part) and a phenotype is decorated with some provenance information. This makes the assumption that both the genotype and phenotype are classes.
currently hardcoded to map the annotation to the monarch namespace :param g: :return:
-
g2p_types
= {'developmental_process': 'GO:0032502'}¶
-
make_g2p_id
()¶ Make an association id for phenotypic associations that is defined by: source of association + (Annot subject) + relationship + phenotype/disease + environment + start stage + end stage
Returns:
-
set_association_id
(assoc_id=None)¶ This will set the association ID based on the internal parts of the association. To be used in cases where an external association identifier should be used.
Parameters: assoc_id – Returns:
-
set_environment
(environment_id)¶
-
set_stage
(start_stage_id, end_stage_id)¶
-
-
class
dipper.models.assoc.InteractionAssoc.
InteractionAssoc
(graph, definedby, subj, obj, rel=None)¶ Bases:
dipper.models.assoc.Association.Assoc
-
interaction_object_properties
= {'colocalizes_with': 'RO:0002325', 'genetically_interacts_with': 'RO:0002435', 'interacts_with': 'RO:0002434', 'molecularly_interacts_with': 'RO:0002436', 'negatively_regulates': 'RO:0003002', 'positively_regulates': 'RO:0003003', 'regulates': 'RO:0002448', 'ubiquitinates': 'RO:0002480'}¶
-
-
class
dipper.models.assoc.OrthologyAssoc.
OrthologyAssoc
(graph, definedby, gene1, gene2, rel=None)¶ Bases:
dipper.models.assoc.Association.Assoc
-
add_gene_family_to_graph
(family_id)¶ Make an association between a group of genes and some grouping class. We make the assumption that the genes in the association are part of the supplied family_id, and that the genes have already been declared as classes elsewhere. The family_id is added as an individual of type DATA:gene_family.
Triples: <family_id> a DATA:gene_family <family_id> RO:has_member <gene1> <family_id> RO:has_member <gene2>
Parameters: - family_id –
- g – the graph to modify
Returns:
-
ortho_rel
= {'has_member': 'RO:0002351', 'homologous': 'RO:HOM0000019', 'in_paralogous': 'RO:HOM0000023', 'least_diverged_orthologous': 'RO:HOM0000020', 'ohnologous': 'RO:HOM0000022', 'orthologous': 'RO:HOM0000017', 'paralogous': 'RO:HOM0000011', 'xenologous': 'RO:HOM0000018'}¶
-
terms
= {'gene_family': 'DATA:3148'}¶
-
-
class
dipper.models.Dataset.
Dataset
(identifier, title, url, description=None, license_url=None, data_rights=None, graph_type=None, file_handle=None)¶ Bases:
object
this will produce the metadata about a dataset following the example laid out here: http://htmlpreview.github.io/? https://github.com/joejimbo/HCLSDatasetDescriptions/blob/master/Overview.html#appendix_1 (mind the wrap)
-
getGraph
()¶
-
get_license
()¶
-
setFileAccessUrl
(url, is_object_literal=False)¶
-
setVersion
(date_issued, version_id=None)¶ - Legacy function…
- should use the other set_* for version and date
as of 2016-10-20 used in:
dipper/sources/HPOAnnotations.py 139: dipper/sources/CTD.py 99: dipper/sources/BioGrid.py 100: dipper/sources/MGI.py 255: dipper/sources/EOM.py 93: dipper/sources/Coriell.py 200: dipper/sources/MMRRC.py 77:
# TODO set as deprecated
Parameters: - date_issued –
- version_id –
Returns:
-
set_citation
(citation_id)¶
-
set_date_issued
(date_issued)¶
-
set_license
(license)¶
-
set_version_by_date
(date_issued=None)¶ This will set the version by the date supplied, the date already stored in the dataset description, or by the download date (today) :param date_issued: :return:
-
set_version_by_num
(version_num)¶
-
-
class
dipper.models.Environment.
Environment
(graph)¶ Bases:
object
These methods provide convenient methods to add items related to an experimental environment and it’s parts to a supplied graph.
This is a stub ready for expansion.
-
addComponentAttributes
(component_id, entity_id, value=None, unit=None)¶
-
addComponentToEnvironment
(env_id, component_id)¶
-
addEnvironment
(env_id, env_label, env_type=None, env_description=None)¶
-
addEnvironmentalCondition
(cond_id, cond_label, cond_type=None, cond_description=None)¶
-
annotation_properties
= {}¶
-
environment_parts
= {'crispr_reagent': 'REO:crispr_TBD', 'environmental_condition': 'XCO:0000000', 'environmental_system': 'ENVO:01000254', 'morpholio_reagent': 'REO:0000042', 'talen_reagent': 'REO:0001022'}¶
-
object_properties
= {'has_part': 'BFO:0000051'}¶
-
properties
= {'has_part': 'BFO:0000051'}¶
-
-
class
dipper.models.Evidence.
Evidence
(graph, association)¶ Bases:
object
To model evidence as the basis for an association. This encompasses:
- measurements taken from the lab, and their significance.
- these can be derived from papers or other agents.
- papers
- >1 measurement may result from an assay,
- each of which may have it’s own significance
-
add_data_individual
(data_curie, label=None, ind_type=None)¶ Add data individual :param data_curie: str either curie formatted or long string,
long strings will be converted to bnodesParameters: - type – str curie
- label – str
Returns: None
-
add_evidence
(evidence_line, ev_type=None, label=None)¶ Add line of evidence node to association id
Parameters: - assoc_id – curie or iri, association id
- evidence_line – curie or iri, evidence line
Returns: None
-
add_source
(evidence_line, source, label=None, src_type=None)¶ Applies the triples: <evidence> <dc:source> <source> <source> <rdf:type> <type> <source> <rdfs:label> “label”
TODO this should belong in a higher level class :param evidence_line: str curie :param source: str source as curie :param label: optional, str type as curie :param type: optional, str type as curie :return: None
-
add_supporting_data
(evidence_line, measurement_dict)¶ Add supporting data :param evidence_line: :param data_object: dict, where keys are curies or iris and values are measurement values for example:
- {
- “_:1234” : “1.53E07” “_:4567”: “20.25”
}
Note: assumes measurements are RDF:Type ‘ed elsewhere :return: None
-
add_supporting_evidence
(evidence_line, type=None, label=None)¶ Add supporting line of evidence node to association id
Parameters: - assoc_id – curie or iri, association id
- evidence_line – curie or iri, evidence line
Returns: None
-
add_supporting_publication
(evidence_line, publication, label=None, pub_type=None)¶ <evidence> <SEPIO:0000124> <source> <source> <rdf:type> <type> <source> <rdfs:label> “label” :param evidence_line: str curie :param publication: str curie :param label: optional, str type as curie :param type: optional, str type as curie :return:
-
data_property
= {'has_measurement': 'IAO:0000004', 'has_value': 'STATO:0000129'}¶
-
data_types
= {'count': 'SIO:000794', 'odds_ratio': 'STATO:0000182', 'proportional_reporting_ratio': 'OAE:0001563'}¶
-
evidence_types
= {'assay': 'OBI:0000070', 'blood test evidence': 'ECO:0001016', 'effect_size': 'STATO:0000085', 'fold_change': 'STATO:0000169', 'measurement datum': 'IAO:0000109', 'percent_change': 'STATO:percent_change', 'pvalue': 'OBI:0000175', 'statistical_hypothesis_test': 'OBI:0000673', 'zscore': 'STATO:0000104'}¶
-
object_properties
= {'has_evidence': 'SEPIO:0000006', 'has_significance': 'STATO:has_significance', 'has_supporting_data': 'SEPIO:0000084', 'has_supporting_evidence': 'SEPIO:0000007', 'has_supporting_reference': 'SEPIO:0000124', 'is_evidence_for': 'SEPIO:0000031', 'is_evidence_supported_by': 'SEPIO:000010', 'is_evidence_with_support_from': 'SEPIO:0000059', 'is_refuting_evidence_for': 'SEPIO:0000033', 'is_supporting_evidence_for': 'SEPIO:0000032', 'source': 'dc:source'}¶
-
class
dipper.models.Family.
Family
(graph)¶ Bases:
object
Model mereological/part whole relationships
Although these relations are more abstract, we often use them to model family relationships (proteins, humans, etc.) The naming of this class may change in the future to better reflect the meaning of the relations it is modeling
-
addMember
(group_id, member_id)¶
-
addMemberOf
(member_id, group_id)¶
-
object_properties
= {'has_member': 'RO:0002351', 'member_of': 'RO:0002350'}¶
-
-
class
dipper.models.GenomicFeature.
Feature
(graph, feature_id=None, label=None, feature_type=None, description=None)¶ Bases:
object
Dealing with genomic features here. By default they are all faldo:Regions. We use SO for typing genomic features. At the moment, RO:has_subsequence is the default relationship between the regions, but this should be tested/verified.
TODO: the graph additions are in the addXToFeature functions, but should be separated. TODO: this will need to be extended to properly deal with fuzzy positions in faldo.
-
addFeatureEndLocation
(coordinate, reference_id, strand=None, position_types=None)¶ Adds the coordinate details for the end of this feature :param coordinate: :param reference_id: :param strand:
Returns:
-
addFeatureProperty
(property_type, property)¶
-
addFeatureStartLocation
(coordinate, reference_id, strand=None, position_types=None)¶ Adds coordinate details for the start of this feature. :param coordinate: :param reference_id: :param strand: :param position_types:
Returns:
-
addFeatureToGraph
(add_region=True, region_id=None, feature_as_class=False)¶ We make the assumption here that all features are instances. The features are located on a region, which begins and ends with faldo:Position The feature locations leverage the Faldo model, which has a general structure like: Triples: feature_id a feature_type (individual) faldo:location region_id region_id a faldo:region faldo:begin start_position faldo:end end_position start_position a (any of: faldo:(((Both|Plus|Minus)Strand)|Exact)Position) faldo:position Integer(numeric position) faldo:reference reference_id end_position a (any of: faldo:(((Both|Plus|Minus)Strand)|Exact)Position) faldo:position Integer(numeric position) faldo:reference reference_id
Parameters: graph – Returns:
-
addPositionToGraph
(reference_id, position, position_types=None, strand=None)¶ Add the positional information to the graph, following the faldo model. We assume that if the strand is None, we give it a generic “Position” only. Triples: my_position a (any of: faldo:(((Both|Plus|Minus)Strand)|Exact)Position) faldo:position Integer(numeric position) faldo:reference reference_id
Parameters: - graph –
- reference_id –
- position –
- position_types –
- strand –
Returns: Identifier of the position created
-
addRegionPositionToGraph
(region_id, begin_position_id, end_position_id)¶
-
addSubsequenceOfFeature
(parentid)¶ This will add reciprocal triples like: feature is_subsequence_of parent parent has_subsequence feature :param graph: :param parentid:
Returns:
-
addTaxonToFeature
(taxonid)¶ Given the taxon id, this will add the following triple: feature in_taxon taxonid :param graph: :param taxonid: :return:
-
annotation_properties
= {}¶
-
data_properties
= {'position': 'faldo:position'}¶
-
object_properties
= {'begin': 'faldo:begin', 'downstream_of_sequence_of': 'RO:0002529', 'end': 'faldo:end', 'gene_product_of': 'RO:0002204', 'has_gene_product': 'RO:0002205', 'has_staining_intensity': 'GENO:0000207', 'has_subsequence': 'RO:0002524', 'is_about': 'IAO:0000136', 'is_subsequence_of': 'RO:0002525', 'location': 'faldo:location', 'reference': 'faldo:reference', 'upstream_of_sequence_of': 'RO:0002528'}¶
-
properties
= {'begin': 'faldo:begin', 'downstream_of_sequence_of': 'RO:0002529', 'end': 'faldo:end', 'gene_product_of': 'RO:0002204', 'has_gene_product': 'RO:0002205', 'has_staining_intensity': 'GENO:0000207', 'has_subsequence': 'RO:0002524', 'is_about': 'IAO:0000136', 'is_subsequence_of': 'RO:0002525', 'location': 'faldo:location', 'position': 'faldo:position', 'reference': 'faldo:reference', 'upstream_of_sequence_of': 'RO:0002528'}¶
-
types
= {'FuzzyPosition': 'faldo:FuzzyPosition', 'Position': 'faldo:Position', 'SNP': 'SO:0000694', 'assembly_component': 'SO:0000143', 'band_intensity': 'GENO:0000618', 'both_strand': 'faldo:BothStrandPosition', 'centromere': 'SO:0000577', 'chromosome': 'SO:0000340', 'chromosome_arm': 'SO:0000105', 'chromosome_band': 'SO:0000341', 'chromosome_part': 'SO:0000830', 'chromosome_region': 'GENO:0000614', 'chromosome_subband': 'GENO:0000616', 'genome': 'SO:0001026', 'gneg': 'GENO:0000620', 'gpos': 'GENO:0000619', 'gpos100': 'GENO:0000622', 'gpos25': 'GENO:0000625', 'gpos33': 'GENO:0000633', 'gpos50': 'GENO:0000624', 'gpos66': 'GENO:0000632', 'gpos75': 'GENO:0000623', 'gvar': 'GENO:0000621', 'haplotype': 'GENO:0000871', 'long_chromosome_arm': 'GENO:0000629', 'minus_strand': 'faldo:MinusStrandPosition', 'plus_strand': 'faldo:PlusStrandPosition', 'reference_genome': 'SO:0001505', 'region': 'faldo:Region', 'score': 'SO:0001685', 'short_chromosome_arm': 'GENO:0000628'}¶
-
-
dipper.models.GenomicFeature.
makeChromID
(chrom, reference=None, prefix=None)¶ This will take a chromosome number and a NCBI taxon number, and create a unique identifier for the chromosome. These identifiers are made in the @base space like: Homo sapiens (9606) chr1 ==> :9606chr1 Mus musculus (10090) chrX ==> :10090chrX
Parameters: - chrom – the chromosome (preferably without any chr prefix)
- reference – the numeric portion of the taxon id
Returns:
-
dipper.models.GenomicFeature.
makeChromLabel
(chrom, reference=None)¶
-
class
dipper.models.Genotype.
Genotype
(graph)¶ Bases:
object
These methods provide convenient methods to add items related to a genotype and it’s parts to a supplied graph. They follow the patterns set out in GENO https://github.com/monarch-initiative/GENO-ontology. For specific sequence features, we use the GenomicFeature class to create them.
-
addAffectedLocus
(allele_id, gene_id, rel_id=None)¶ We make the assumption here that if the relationship is not provided, it is a GENO:is_sequence_variant_instance_of.
Here, the allele should be a variant_locus, not a sequence alteration. :param allele_id: :param gene_id: :param rel_id: :return:
-
addAllele
(allele_id, allele_label, allele_type=None, allele_description=None)¶ Make an allele object. If no allele_type is added, it will default to a geno:allele :param allele_id: curie for allele (required) :param allele_label: label for allele (required) :param allele_type: id for an allele type (optional, recommended SO or GENO class) :param allele_description: a free-text description of the allele :return:
-
addAlleleOfGene
(allele_id, gene_id, rel_id=None)¶ We make the assumption here that if the relationship is not provided, it is a GENO:is_sequence_variant_instance_of.
Here, the allele should be a variant_locus, not a sequence alteration. :param allele_id: :param gene_id: :param rel_id: :return:
-
addChromosome
(chr, tax_id, tax_label=None, build_id=None, build_label=None)¶ if it’s just the chromosome, add it as an instance of a SO:chromosome, and add it to the genome. If a build is included, punn the chromosome as a subclass of SO:chromsome, and make the build-specific chromosome an instance of the supplied chr. The chr then becomes part of the build or genome.
-
addChromosomeClass
(chrom_num, taxon_id, taxon_label)¶
-
addChromosomeInstance
(chr_num, reference_id, reference_label, chr_type=None)¶ Add the supplied chromosome as an instance within the given reference :param chr_num: :param reference_id: for example, a build id like UCSC:hg19 :param reference_label: :param chr_type: this is the class that this is an instance of. typically a genome-specific chr
Returns:
-
addConstruct
(construct_id, construct_label, construct_type=None, construct_description=None)¶
-
addDerivesFrom
(child_id, parent_id)¶ We add a derives_from relationship between the child and parent id. Examples of uses include between: an allele and a construct or strain here, a cell line and it’s parent genotype. Adding the parent and child to the graph should happen outside of this function call to ensure graph integrity. :param child_id: :param parent_id: :return:
-
addGene
(gene_id, gene_label, gene_type=None, gene_description=None)¶
-
addGeneProduct
(sequence_id, product_id, product_label=None, product_type=None)¶ Add gene/variant/allele has_gene_product relationship Can be used to either describe a gene to transcript relationship or gene to protein :param sequence_id: :param product_id: :param product_label: :param product_type: :return:
-
addGeneTargetingReagent
(reagent_id, reagent_label, reagent_type, gene_id, description=None)¶ Here, a gene-targeting reagent is added. The actual targets of this reagent should be added separately. :param reagent_id: :param reagent_label: :param reagent_type:
Returns:
-
addGeneTargetingReagentToGenotype
(reagent_id, genotype_id)¶
-
addGenome
(taxon_id, taxon_label=None)¶
-
addGenomicBackground
(background_id, background_label, background_type=None, background_description=None)¶
-
addGenomicBackgroundToGenotype
(background_id, genotype_id, background_type=None)¶
-
addGenotype
(genotype_id, genotype_label, genotype_type=None, genotype_description=None)¶ If a genotype_type is not supplied, we will default to ‘intrinsic_genotype’ :param genotype_id: :param genotype_label: :param genotype_type: :param genotype_description: :return:
-
addMemberOfPopulation
(member_id, population_id)¶
-
addParts
(part_id, parent_id, part_relationship=None)¶ This will add a has_part (or subproperty) relationship between a parent_id and the supplied part. By default the relationship will be BFO:has_part, but any relationship could be given here. :param part_id: :param parent_id: :param part_relationship: :return:
-
addPartsToVSLC
(vslc_id, allele1_id, allele2_id, zygosity_id=None, allele1_rel=None, allele2_rel=None)¶ Here we add the parts to the VSLC. While traditionally alleles (reference or variant loci) are traditionally added, you can add any node (such as sequence_alterations for unlocated variations) to a vslc if they are known to be paired. However, if a sequence_alteration’s loci is unknown, it probably should be added directly to the GVC. :param vslc_id: :param allele1_id: :param allele2_id: :param zygosity_id: :param allele1_rel: :param allele2_rel: :return:
-
addPolypeptide
(polypeptide_id, polypeptide_label=None, transcript_id=None, polypeptide_type=None)¶ Parameters: - polypeptide_id –
- polypeptide_label –
- polypeptide_type –
- transcript_id –
Returns:
-
addReagentTargetedGene
(reagent_id, gene_id, targeted_gene_id=None, targeted_gene_label=None, description=None)¶ This will create the instance of a gene that is targeted by a molecular reagent (such as a morpholino or rnai). If an instance id is not supplied, we will create it as an anonymous individual which is of the type GENO:reagent_targeted_gene. We will also add the targets relationship between the reagent and gene class.
<targeted_gene_id> a GENO:reagent_targeted_gene rdf:label targeted_gene_label dc:description description <reagent_id> GENO:targets_instance_of <gene_id>
Parameters: - reagent_id –
- gene_id –
- targeted_gene_id –
Returns:
-
addReferenceGenome
(build_id, build_label, taxon_id)¶
-
addSequenceAlteration
(sa_id, sa_label, sa_type=None, sa_description=None)¶
-
addSequenceAlterationToVariantLocus
(sa_id, vl_id)¶
-
addSequenceDerivesFrom
(child_id, parent_id)¶
-
addTargetedGeneComplement
(tgc_id, tgc_label, tgc_type=None, tgc_description=None)¶
-
addTargetedGeneSubregion
(tgs_id, tgs_label, tgs_type=None, tgs_description=None)¶
-
addTaxon
(taxon_id, genopart_id)¶ The supplied geno part will have the specified taxon added with RO:in_taxon relation. Generally the taxon is associated with a genomic_background, but could be added to any genotype part (including a gene, regulatory element, or sequence alteration). :param taxon_id: :param genopart_id:
Returns:
-
addVSLCtoParent
(vslc_id, parent_id)¶ The VSLC can either be added to a genotype or to a GVC. The vslc is added as a part of the parent. :param vslc_id: :param parent_id: :return:
-
annotation_properties
= {'altered_nucleotide': 'GENO:altered_nucleotide', 'reference_amino_acid': 'GENO:reference_amino_acid', 'reference_nucleotide': 'GENO:reference_nucleotide', 'results_in_amino_acid_change': 'GENO:results_in_amino_acid_change'}¶
-
genoparts
= {'QTL': 'SO:0000771', 'RNAi_reagent': 'SO:0000337', 'allele': 'GENO:0000512', 'biological_region': 'SO:0001411', 'cDNA': 'SO:0000756', 'coding_transgene_feature': 'GENO:0000638', 'cytogenetic marker': 'SO:0000341', 'deletion': 'SO:0000159', 'duplication': 'SO:1000035', 'effective_genotype': 'GENO:0000525', 'extrinsic_genotype': 'GENO:0000524', 'family': 'PCO:0000020', 'female_genotype': 'GENO:0000647', 'gene': 'SO:0000704', 'genomic_background': 'GENO:0000611', 'genomic_variation_complement': 'GENO:0000009', 'heritable_phenotypic_marker': 'SO:0001500', 'insertion': 'SO:0000667', 'intrinsic_genotype': 'GENO:0000000', 'inversion': 'SO:1000036', 'karyotype_variation_complement': 'GENO:0000644', 'male_genotype': 'GENO:0000646', 'missense_variant': 'SO:0001583', 'ncRNA_gene': 'SO:0001263', 'point_mutation': 'SO:1000008', 'polypeptide': 'SO:0000104', 'population': 'PCO:0000001', 'protein_coding_gene': 'SO:0001217', 'pseudogene': 'SO:0000336', 'reagent_targeted_gene': 'GENO:0000504', 'reference_locus': 'GENO:0000036', 'regulatory_transgene_feature': 'GENO:0000637', 'sequence_alteration': 'SO:0001059', 'sequence_feature': 'SO:0000110', 'sequence_variant_affecting_polypeptide_function': 'SO:1000117', 'sequence_variant_causing_gain_of_function_of_polypeptide': 'SO:1000125', 'sequence_variant_causing_inactive_catalytic_site': 'SO:1000120', 'sequence_variant_causing_loss_of_function_of_polypeptide': 'SO:1000118', 'sex_qualified_genotype': 'GENO:0000645', 'substitution': 'SO:1000002', 'tandem_duplication': 'SO:1000173', 'targeted_gene_complement': 'GENO:0000527', 'targeted_gene_subregion': 'GENO:0000534', 'transcript': 'SO:0000233', 'transgene': 'SO:0000902', 'transgenic_insertion': 'SO:0001218', 'translocation': 'SO:0000199', 'unspecified_genomic_background': 'GENO:0000649', 'variant_locus': 'GENO:0000002', 'variant_single_locus_complement': 'GENO:0000030', 'wildtype': 'GENO:0000511'}¶
-
makeGenomeID
(taxon_id)¶
-
make_experimental_model_with_genotype
(genotype_id, genotype_label, taxon_id, taxon_label)¶
-
make_variant_locus_label
(gene_label, allele_label)¶
-
make_vslc_label
(gene_label, allele1_label, allele2_label)¶ Make a Variant Single Locus Complement (VSLC) in monarch-style. :param gene_label: :param allele1_label: :param allele2_label: :return:
-
object_properties
= {'derives_from': 'RO:0001000', 'derives_sequence_from_gene': 'GENO:0000639', 'has_affected_locus': 'GENO:0000418', 'has_alternate_part': 'GENO:0000382', 'has_gene_product': 'RO:0002205', 'has_genotype': 'GENO:0000222', 'has_member_with_allelotype': 'GENO:0000225', 'has_part': 'BFO:0000051', 'has_phenotype': 'RO:0002200', 'has_reference_part': 'GENO:0000385', 'has_sex_agnostic_genotype_part': 'GENO:0000650', 'has_variant_part': 'GENO:0000382', 'has_zygosity': 'GENO:0000608', 'in_taxon': 'RO:0002162', 'is_allelotype_of': 'GENO:0000206', 'is_mutant_of': 'GENO:0000440', 'is_reference_instance_of': 'GENO:0000610', 'is_sequence_variant_instance_of': 'GENO:0000408', 'is_targeted_expression_variant_of': 'GENO:0000443', 'is_transgene_variant_of': 'GENO:0000444', 'targeted_by': 'GENO:0000634', 'targets_instance_of': 'GENO:0000414', 'translates_to': 'RO:0002513'}¶
-
properties
= {'altered_nucleotide': 'GENO:altered_nucleotide', 'derives_from': 'RO:0001000', 'derives_sequence_from_gene': 'GENO:0000639', 'has_affected_locus': 'GENO:0000418', 'has_alternate_part': 'GENO:0000382', 'has_gene_product': 'RO:0002205', 'has_genotype': 'GENO:0000222', 'has_member_with_allelotype': 'GENO:0000225', 'has_part': 'BFO:0000051', 'has_phenotype': 'RO:0002200', 'has_reference_part': 'GENO:0000385', 'has_sex_agnostic_genotype_part': 'GENO:0000650', 'has_variant_part': 'GENO:0000382', 'has_zygosity': 'GENO:0000608', 'in_taxon': 'RO:0002162', 'is_allelotype_of': 'GENO:0000206', 'is_mutant_of': 'GENO:0000440', 'is_reference_instance_of': 'GENO:0000610', 'is_sequence_variant_instance_of': 'GENO:0000408', 'is_targeted_expression_variant_of': 'GENO:0000443', 'is_transgene_variant_of': 'GENO:0000444', 'reference_amino_acid': 'GENO:reference_amino_acid', 'reference_nucleotide': 'GENO:reference_nucleotide', 'results_in_amino_acid_change': 'GENO:results_in_amino_acid_change', 'targeted_by': 'GENO:0000634', 'targets_instance_of': 'GENO:0000414', 'translates_to': 'RO:0002513'}¶
-
zygosity
= {'complex_heterozygous': 'GENO:0000402', 'hemizygous': 'GENO:0000606', 'hemizygous-x': 'GENO:0000605', 'hemizygous-y': 'GENO:0000604', 'heteroplasmic': 'GENO:0000603', 'heterozygous': 'GENO:0000135', 'homoplasmic': 'GENO:0000602', 'homozygous': 'GENO:0000136', 'indeterminate': 'GENO:0000137', 'simple_heterozygous': 'GENO:0000458'}¶
-
-
class
dipper.models.Model.
Model
(graph)¶ Bases:
object
Utility class to add common triples to a graph (subClassOf, type, label, sameAs)
-
addBlankNodeAnnotation
(node_id)¶ Add an annotation property to the given
`node_id`
to be a pseudo blank node. This is a monarchism. :param node_id: :return:
-
addClassToGraph
(class_id, label=None, class_type=None, description=None)¶ Any node added to the graph will get at least 3 triples: *(node, type, owl:Class) and *(node, label, literal(label)) *if a type is added,
then the node will be an OWL:subclassOf that the type- *if a description is provided,
- it will also get added as a dc:description
Parameters: - class_id –
- label –
- class_type –
- description –
Returns:
-
addComment
(subject_id, comment)¶
-
addDefinition
(class_id, definition)¶
-
addDepiction
(subject_id, image_url)¶
-
addDeprecatedClass
(old_id, new_ids=None)¶ Will mark the oldid as a deprecated class. if one newid is supplied, it will mark it as replaced by. if >1 newid is supplied, it will mark it with consider properties :param old_id: str - the class id to deprecate :param new_ids: list - the class list that is
the replacement(s) of the old class. Not required.Returns: None
-
addDeprecatedIndividual
(old_id, new_ids=None)¶ Will mark the oldid as a deprecated individual. if one newid is supplied, it will mark it as replaced by. if >1 newid is supplied, it will mark it with consider properties :param g: :param oldid: the individual id to deprecate :param newids: the individual idlist that is the replacement(s) of
the old individual. Not required.Returns:
-
addDescription
(subject_id, description)¶
-
addEquivalentClass
(sub, obj)¶
-
addIndividualToGraph
(ind_id, label, ind_type=None, description=None)¶
-
addLabel
(subject_id, label)¶
-
addOWLPropertyClassRestriction
(class_id, property_id, property_value)¶
-
addOWLVersionIRI
(ontology_id, version_iri)¶
-
addOWLVersionInfo
(ontology_id, version_info)¶
-
addOntologyDeclaration
(ontology_id)¶
-
addPerson
(person_id, person_label=None)¶
-
addSameIndividual
(sub, obj)¶
-
addSubClass
(child_id, parent_id)¶
-
addSynonym
(class_id, synonym, synonym_type=None)¶ Add the synonym as a property of the class cid. Assume it is an exact synonym, unless otherwise specified :param g: :param cid: class id :param synonym: the literal synonym label :param synonym_type: the CURIE of the synonym type (not the URI) :return:
-
addTriple
(subject_id, predicate_id, obj, object_is_literal=False, literal_type=None)¶
-
addType
(subject_id, subject_type)¶
-
addXref
(class_id, xref_id, xref_as_literal=False)¶
-
annotation_properties
= {'clique_leader': 'MONARCH:cliqueLeader', 'comment': 'dc:comment', 'consider': 'OIO:consider', 'definition': 'IAO:0000115', 'depiction': 'foaf:depiction', 'description': 'dc:description', 'hasExactSynonym': 'OIO:hasExactSynonym', 'hasRelatedSynonym': 'OIO:hasRelatedSynonym', 'has_xref': 'OIO:hasDbXref', 'inchi_key': 'CHEBI:InChIKey', 'is_anonymous': 'MONARCH:anonymous', 'label': 'rdfs:label', 'replaced_by': 'IAO:0100001', 'version_info': 'owl:versionInfo'}¶
-
datatype_properties
= {'has_measurement': 'IAO:0000004', 'position': 'faldo:position'}¶
-
makeLeader
(node_id)¶ Add an annotation property to the given
`node_id`
to be the clique_leader. This is a monarchism. :param node_id: :return:
-
object_properties
= {'causally_influences': 'RO:0002566', 'causally_upstream_of_or_within': 'RO:0002418', 'contributes_to': 'RO:0002326', 'correlates_with': 'RO:0002610', 'dc:evidence': 'dc:evidence', 'dc:source': 'dc:source', 'derives_from': 'RO:0001000', 'enables': 'RO:0002327', 'ends_during': 'RO:0002093', 'ends_with': 'RO:0002230', 'equivalent_class': 'owl:equivalentClass', 'existence_ends_at': 'UBERON:existence_ends_at', 'existence_ends_during': 'RO:0002492', 'existence_starts_at': 'UBERON:existence_starts_at', 'existence_starts_during': 'RO:0002488', 'has disposition': 'RO:0000091', 'has_author': 'ERO:0000232', 'has_begin_stage_qualifier': 'GENO:0000630', 'has_end_stage_qualifier': 'GENO:0000631', 'has_environment_qualifier': 'GENO:0000580', 'has_evidence': 'RO:0002558', 'has_gene_product': 'RO:0002205', 'has_object': ':hasObject', 'has_origin': 'GENO:0000643', 'has_part': 'BFO:0000051', 'has_phenotype': 'RO:0002200', 'has_predicate': ':hasPredicate', 'has_qualifier': 'GENO:0000580', 'has_quality': 'RO:0000086', 'has_sex_specificity': ':has_sex_specificity', 'has_subject': ':hasSubject', 'in_taxon': 'RO:0002162', 'involved_in': 'RO:0002331', 'is_about': 'IAO:0000136', 'is_marker_for': 'RO:0002607', 'mentions': 'IAO:0000142', 'model_of': 'RO:0003301', 'occurs_in': 'BFO:0000066', 'on_property': 'owl:onProperty', 'part_of': 'BFO:0000050', 'same_as': 'owl:sameAs', 'some_values_from': 'owl:someValuesFrom', 'starts_during': 'RO:0002091', 'starts_with': 'RO:0002224', 'subclass_of': 'rdfs:subClassOf', 'substance_that_treats': 'RO:0002606', 'towards': 'RO:0002503', 'type': 'rdf:type', 'version_iri': 'owl:versionIRI'}¶
-
types
= {'annotation_property': 'owl:AnnotationProperty', 'class': 'owl:Class', 'datatype_property': 'owl:DatatypeProperty', 'deprecated': 'owl:deprecated', 'named_individual': 'owl:NamedIndividual', 'object_property': 'owl:ObjectProperty', 'ontology': 'owl:Ontology', 'person': 'foaf:Person', 'restriction': 'owl:Restriction'}¶
-
-
class
dipper.models.Pathway.
Pathway
(graph)¶ Bases:
object
This provides convenience methods to deal with gene and protein collections in the context of pathways.
-
addComponentToPathway
(component_id, pathway_id)¶ This can be used directly when the component is directly involved in the pathway. If a transforming event is performed on the component first, then the addGeneToPathway should be used instead.
Parameters: - pathway_id –
- component_id –
Returns:
-
addGeneToPathway
(gene_id, pathway_id)¶ When adding a gene to a pathway, we create an intermediate ‘gene product’ that is involved in the pathway, through a blank node.
gene_id RO:has_gene_product _gene_product _gene_product RO:involved_in pathway_id
Parameters: - pathway_id –
- gene_id –
Returns:
-
addPathway
(pathway_id, pathway_label, pathway_type=None, pathway_description=None)¶ Adds a pathway as a class. If no specific type is specified, it will default to a subclass of “GO:cellular_process” and “PW:pathway”. :param pathway_id: :param pathway_label: :param pathway_type: :param pathway_description: :return:
-
object_properties
= {'gene_product_of': 'RO:0002204', 'has_gene_product': 'RO:0002205', 'involved_in': 'RO:0002331'}¶
-
pathway_parts
= {'cellular_process': 'GO:0009987', 'gene_product': 'CHEBI:33695', 'pathway': 'PW:0000001', 'signal_transduction': 'GO:0007165'}¶
-
properties
= {'gene_product_of': 'RO:0002204', 'has_gene_product': 'RO:0002205', 'involved_in': 'RO:0002331'}¶
-
-
class
dipper.models.Provenance.
Provenance
(graph)¶ Bases:
object
To model provenance as the basis for an association. This encompasses:
- Process history leading to a claim being made, including processes through which evidence is evaluated
- Processes through which information used as evidence is created.
- Provenance metadata includes accounts of who conducted these processes,
- what entities participated in them, and when/where they occurred.
-
add_agent_to_graph
(agent_id, agent_label, agent_type=None, agent_description=None)¶
-
add_assay_to_graph
(assay_id, assay_label, assay_type=None, assay_description=None)¶
-
add_assertion
(assertion, agent, agent_label, date=None)¶ Add assertion to graph :param assertion: :param agent: :param evidence_line: :param date: :return: None
-
add_date_created
(prov_type, date)¶
-
add_study_measure
(study, measure)¶
-
add_study_parts
(study, study_parts)¶
-
add_study_to_measurements
(study, measurements)¶
-
object_properties
= {'asserted_by': 'SEPIO:0000130', 'created_at_location': 'SEPIO:0000019', 'created_by': 'SEPIO:0000018', 'created_on': 'pav:createdOn', 'created_with_resource': 'SEPIO:0000022', 'date_created': 'SEPIO:0000021', 'has_agent': 'SEPIO:0000017', 'has_input': 'RO:0002233', 'has_participant': 'RO:0000057', 'has_provenance': 'SEPIO:0000011', 'has_supporting_study': 'SEPIO:0000085', 'is_asserted_in': 'SEPIO:0000015', 'is_assertion_supported_by': 'SEPIO:0000111', 'measures': 'SEPIO:0000114', 'output_of': 'RO:0002353', 'specified_by': 'SEPIO:0000041'}¶
-
provenance_types
= {'assay': 'OBI:0000070', 'assertion': 'SEPIO:0000001', 'assertion_process': 'SEPIO:0000003', 'mixed_model': 'STATO:0000189', 'organization': 'foaf:organization', 'person': 'foaf:person', 'project': 'VIVO:Project', 'statistical_hypothesis_test': 'OBI:0000673', 'study': 'OBI:0000471', 'variant_classification_guideline': 'SEPIO:0000037', 'xref': 'OIO:hasdbxref'}¶
-
class
dipper.models.Reference.
Reference
(graph, ref_id=None, ref_type=None)¶ Bases:
object
- To model references for associations
- (such as journal articles, books, etc.).
- By default, references will be typed as “documents”,
- unless if the type is set otherwise.
- If a short_citation is set, this will be used for the individual’s label.
- We may wish to subclass this later.
-
addAuthor
(author)¶
-
addPage
(subject_id, page_url)¶
-
addRefToGraph
()¶
-
addTitle
(subject_id, title)¶
-
annotation_properties
= {'page': 'foaf:page', 'title': 'dc:title'}¶
-
ref_types
= {'document': 'IAO:0000310', 'journal_article': 'IAO:0000013', 'person': 'foaf:Person', 'photograph': 'IAO:0000185', 'publication': 'IAO:0000311', 'webpage': 'SIO:000302'}¶
-
setAuthorList
(author_list)¶ Parameters: author_list – Array of authors Returns:
-
setShortCitation
(citation)¶
-
setTitle
(title)¶
-
setType
(reference_type)¶
-
setYear
(year)¶
dipper.sources package¶
-
class
dipper.sources.AnimalQTLdb.
AnimalQTLdb
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
The Animal Quantitative Trait Loci (QTL) database (Animal QTLdb) is designed to house publicly all available QTL and single-nucleotide polymorphism/gene association data on livestock animal species. This includes:
- chicken
- horse
- cow
- sheep
- rainbow trout
- pig
While most of the phenotypes here are related to animal husbandry, production, and rearing, integration of these phenotypes with other species may lead to insight for human disease.
Here, we use the QTL genetic maps and their computed genomic locations to create associations between the QTLs and their traits. The traits come in their internal Animal Trait ontology vocabulary, which they further map to [Vertebrate Trait](http://bioportal.bioontology.org/ontologies/VT), Product Trait, and Clinical Measurement Ontology vocabularies.
Since these are only associations to broad locations, we link the traits via “is_marker_for”, since there is no specific causative nature in the association. p-values for the associations are attached to the Association objects. We default to the UCSC build for the genomic coordinates, and make equivalences.
Any genetic position ranges that are <0, we do not include here.
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
files
= {'cattle_bp': {'url': 'http://www.animalgenome.org/QTLdb/tmp/QTL_Btau_4.6.gff.txt.gz', 'file': 'QTL_Btau_4.6.gff.txt.gz'}, 'cattle_cm': {'url': 'http://www.animalgenome.org/QTLdb/export/KSUI8GFHOT6/cattle_QTLdata.txt', 'file': 'cattle_QTLdata.txt'}, 'chicken_bp': {'url': 'http://www.animalgenome.org/QTLdb/tmp/QTL_GG_4.0.gff.txt.gz', 'file': 'QTL_GG_4.0.gff.txt.gz'}, 'chicken_cm': {'url': 'http://www.animalgenome.org/QTLdb/export/KSUI8GFHOT6/chicken_QTLdata.txt', 'file': 'chicken_QTLdata.txt'}, 'horse_bp': {'url': 'http://www.animalgenome.org/QTLdb/tmp/QTL_EquCab2.0.gff.txt.gz', 'file': 'QTL_EquCab2.0.gff.txt.gz'}, 'horse_cm': {'url': 'http://www.animalgenome.org/QTLdb/export/KSUI8GFHOT6/horse_QTLdata.txt', 'file': 'horse_QTLdata.txt'}, 'pig_bp': {'url': 'http://www.animalgenome.org/QTLdb/tmp/QTL_SS_10.2.gff.txt.gz', 'file': 'QTL_SS_10.2.gff.txt.gz'}, 'pig_cm': {'url': 'http://www.animalgenome.org/QTLdb/export/KSUI8GFHOT6/pig_QTLdata.txt', 'file': 'pig_QTLdata.txt'}, 'rainbow_trout_cm': {'url': 'http://www.animalgenome.org/QTLdb/export/KSUI8GFHOT6/rainbow_trout_QTLdata.txt', 'file': 'rainbow_trout_QTLdata.txt'}, 'sheep_bp': {'url': 'http://www.animalgenome.org/QTLdb/tmp/QTL_OAR_3.1.gff.txt.gz', 'file': 'QTL_OAR_3.1.gff.txt.gz'}, 'sheep_cm': {'url': 'http://www.animalgenome.org/QTLdb/export/KSUI8GFHOT6/sheep_QTLdata.txt', 'file': 'sheep_QTLdata.txt'}, 'trait_mappings': {'url': 'http://www.animalgenome.org/QTLdb/export/trait_mappings.csv', 'file': 'trait_mappings'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ Parameters: limit – Returns:
-
test_ids
= {1795, 28483, 32133, 1798, 29385, 29018, 31023, 8945, 17138, 12532, 29016, 14234}¶
-
class
dipper.sources.Bgee.
Bgee
(graph_type, are_bnodes_skolemized, tax_ids=None, version=None)¶ Bases:
dipper.sources.Source.Source
Bgee is a database to retrieve and compare gene expression patterns between animal species.
Bgee first maps heterogeneous expression data (currently RNA-Seq, Affymetrix, in situ hybridization, and EST data) to anatomy and development of different species.
Then, in order to perform automated cross species comparisons, homology relationships across anatomies, and comparison criteria between developmental stages, are designed.
-
BGEE_FTP
= 'ftp.bgee.org'¶
-
DEFAULT_TAXA
= [10090, 10116, 13616, 28377, 6239, 7227, 7955, 8364, 9031, 9258, 9544, 9593, 9597, 9598, 9606, 9823, 9913]¶
-
checkIfRemoteIsNewer
(localfile, remote_size, remote_modify)¶ Overrides checkIfRemoteIsNewer in Source class
Parameters: - localfile – str file path
- remote_size – str bytes
- remote_modify – str last modify date in the form 20160705042714
Returns: boolean True if remote file is newer else False
-
fetch
(is_dl_forced=False)¶ Parameters: is_dl_forced – boolean, force download Returns:
-
files
= {'anat_entity': {'path': '/download/ranks/anat_entity/', 'pattern': re.compile('.*_all_data_.*')}}¶
-
parse
(limit=None)¶ Given the input taxa, expects files in the raw directory with the name {tax_id}_anat_entity_all_data_Pan_troglodytes.tsv.zip
Parameters: limit – int Limit to top ranked anatomy associations per group Returns: None
-
-
class
dipper.sources.BioGrid.
BioGrid
(graph_type, are_bnodes_skolemized, tax_ids=None)¶ Bases:
dipper.sources.Source.Source
Biogrid interaction data
-
biogrid_ids
= [106638, 107308, 107506, 107674, 107675, 108277, 108506, 108767, 108814, 108899, 110308, 110364, 110678, 111642, 112300, 112365, 112771, 112898, 199832, 203220, 247276, 120150, 120160, 124085]¶
-
fetch
(is_dl_forced=False)¶ Parameters: is_dl_forced – Returns: None
-
files
= {'identifiers': {'url': 'http://thebiogrid.org/downloads/archives/Latest%20Release/BIOGRID-IDENTIFIERS-LATEST.tab.zip', 'file': 'identifiers.tab.zip'}, 'interactions': {'url': 'http://thebiogrid.org/downloads/archives/Latest%20Release/BIOGRID-ALL-LATEST.mitab.zip', 'file': 'interactions.mitab.zip'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ Parameters: limit – Returns:
-
-
class
dipper.sources.CTD.
CTD
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
The Comparative Toxicogenomics Database (CTD) includes curated data describing cross-species chemical–gene/protein interactions and chemical– and gene–disease associations to illuminate molecular mechanisms underlying variable susceptibility and environmentally influenced diseases.
Here, we fetch, parse, and convert data from CTD into triples, leveraging only the associations based on DIRECT evidence (not using the inferred associations). We currently process the following associations: * chemical-disease * gene-pathway * gene-disease
CTD curates relationships between genes and chemicals/diseases with marker/mechanism and/or therapeutic. Unfortunately, we cannot disambiguate between marker (gene expression) and mechanism (causation) for these associations. Therefore, we are left to relate these simply by “marker”.
CTD also pulls in genes and pathway membership from KEGG and REACTOME. We create groups of these following the pattern that the specific pathway is a subclass of ‘cellular process’ (a go process), and the gene is “involved in” that process.
For diseases, we preferentially use OMIM identifiers when they can be used uniquely over MESH. Otherwise, we use MESH ids.
Note that we scrub the following identifiers and their associated data: * REACT:REACT_116125 - generic disease class * MESH:D004283 - dog diseases * MESH:D004195 - disease models, animal * MESH:D030342 - genetic diseases, inborn * MESH:D040181 - genetic dieases, x-linked * MESH:D020022 - genetic predisposition to a disease
-
fetch
(is_dl_forced=False)¶ Override Source.fetch() Fetches resources from CTD using the CTD.files dictionary Args: :param is_dl_forced (bool): Force download Returns: :return None
-
files
= {'chemical_disease_interactions': {'url': 'http://ctdbase.org/reports/CTD_chemicals_diseases.tsv.gz', 'file': 'CTD_chemicals_diseases.tsv.gz'}, 'gene_disease': {'url': 'http://ctdbase.org/reports/CTD_genes_diseases.tsv.gz', 'file': 'CTD_genes_diseases.tsv.gz'}, 'gene_pathway': {'url': 'http://ctdbase.org/reports/CTD_genes_pathways.tsv.gz', 'file': 'CTD_genes_pathways.tsv.gz'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ Override Source.parse() Parses version and interaction information from CTD Args: :param limit (int, optional) limit the number of rows processed Returns: :return None
-
static_files
= {'publications': {'file': 'CTD_curated_references.tsv'}}¶
-
-
class
dipper.sources.ClinVar.
ClinVar
(graph_type, are_bnodes_skolemized, tax_ids=None, gene_ids=None)¶ Bases:
dipper.sources.Source.Source
ClinVar is a host of clinically relevant variants, both directly-submitted and curated from the literature. We process the variant_summary file here, which is a digested version of their full xml. We add all variants (and coordinates/build) from their system.
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
files
= {'variant_citations': {'url': 'http://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/var_citations.txt', 'file': 'variant_citations.txt'}, 'variant_summary': {'url': 'http://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz', 'file': 'variant_summary.txt.gz'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
scrub
()¶ The var_citations file has a bad row in it with > 6 cols. I will comment these out.
Returns:
-
variant_ids
= [4288, 4289, 4290, 4291, 4297, 5240, 5241, 5242, 5243, 5244, 5245, 5246, 7105, 8877, 9295, 9296, 9297, 9298, 9449, 10072, 10361, 10382, 12528, 12529, 12530, 12531, 12532, 14353, 14823, 15872, 17232, 17233, 17234, 17235, 17236, 17237, 17238, 17239, 17284, 17285, 17286, 17287, 18179, 18180, 18181, 18343, 18363, 31951, 37123, 38562, 94060, 98004, 98005, 98006, 98008, 98009, 98194, 98195, 98196, 98197, 98198, 100055, 112885, 114372, 119244, 128714, 130558, 130559, 130560, 130561, 132146, 132147, 132148, 144375, 146588, 147536, 147814, 147936, 152976, 156327, 161457, 162000, 167132]¶
-
-
class
dipper.sources.Coriell.
Coriell
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
The Coriell Catalog provided to Monarch includes metadata and descriptions of NIGMS, NINDS, NHGRI, and NIA cell lines. These lines are made available for research purposes. Here, we create annotations for the cell lines as models of the diseases from which they originate.
We create a handle for a patient from which the given cell line is derived (since there may be multiple cell lines created from a given patient). A genotype is assembled for a patient, which includes a karyotype (if specified) and/or a collection of variants. Both the genotype (has_genotype) and disease are linked to the patient (has_phenotype), and the cell line is listed as derived from the patient. The cell line is classified by it’s [CLO cell type](http://www.ontobee.org/browser/index.php?o=clo), which itself is linked to a tissue of origin.
Unfortunately, the omim numbers listed in this file are both for genes & diseases; we have no way of knowing a priori if a designated omim number is a gene or disease; so we presently link the patient to any omim id via the has_phenotype relationship.
Notice: The Coriell catalog is delivered to Monarch in a specific format, and requires ssh rsa fingerprint identification. Other groups wishing to get this data in it’s raw form will need to contact Coriell for credential This needs to be placed into your configuration file for it to work.
-
fetch
(is_dl_forced=False)¶ Here we connect to the coriell sftp server using private connection details. They dump bi-weekly files with a timestamp in the filename. For each catalog, we poll the remote site and pull the most-recently updated file, renaming it to our local latest.csv.
Be sure to have pg user/password connection details in your conf.json file, like: dbauth : {“coriell” : { “user” : “<username>”, “password” : “<password>”, “host” : <host>, “private_key”=path/to/rsa_key} }
Parameters: is_dl_forced – Returns:
-
files
= {'NHGRI': {'label': 'NHGRI Sample Repository for Human Genetic Research', 'page': 'https://catalog.coriell.org/1/NHGRI', 'id': 'NHGRI', 'file': 'NHGRI.csv'}, 'NIA': {'label': 'NIA Aging Cell Repository', 'page': 'https://catalog.coriell.org/1/NIA', 'id': 'NIA', 'file': 'NIA.csv'}, 'NIGMS': {'label': 'NIGMS Human Genetic Cell Repository', 'page': 'https://catalog.coriell.org/1/NIGMS', 'id': 'NIGMS', 'file': 'NIGMS.csv'}, 'NINDS': {'label': 'NINDS Human Genetics DNA and Cell line Repository', 'page': 'https://catalog.coriell.org/1/NINDS', 'id': 'NINDS', 'file': 'NINDS.csv'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
terms
= {'age': 'EFO:0000246', 'cell_line_repository': 'CLO:0000008', 'collection': 'ERO:0002190', 'ethnic_group': 'EFO:0001799', 'race': 'SIO:001015', 'sampling_time': 'EFO:0000689'}¶
-
test_lines
= ['ND02380', 'ND02381', 'ND02383', 'ND02384', 'GM17897', 'GM17898', 'GM17896', 'GM17944', 'GM17945', 'ND00055', 'ND00094', 'ND00136', 'GM17940', 'GM17939', 'GM20567', 'AG02506', 'AG04407', 'AG07602AG07601', 'GM19700', 'GM19701', 'GM19702', 'GM00324', 'GM00325', 'GM00142', 'NA17944', 'AG02505', 'GM01602', 'GM02455', 'AG00364', 'GM13707', 'AG00780']¶
-
-
class
dipper.sources.Decipher.
Decipher
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
The Decipher group curates and assembles the Development Disorder Genotype Phenotype Database (DDG2P) which is a curated list of genes reported to be associated with developmental disorders, compiled by clinicians as part of the DDD study to facilitate clinical feedback of likely causal variants.
Beware that the redistribution of this data is a bit unclear from the [license](https://decipher.sanger.ac.uk/legal). If you intend to distribute this data, be sure to have the appropriate licenses in place.
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
files
= {'annot': {'url': 'https://decipher.sanger.ac.uk/files/ddd/ddg2p.zip', 'file': 'ddg2p.zip'}}¶
-
make_allele_by_consequence
(consequence, gene_id, gene_symbol)¶ Given a “consequence” label that describes a variation type, create an anonymous variant of the specified gene as an instance of that consequence type.
Parameters: - consequence –
- gene_id –
- gene_symbol –
Returns: allele_id
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
-
class
dipper.sources.EOM.
EOM
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.PostgreSQLSource.PostgreSQLSource
Elements of Morphology is a resource from NHGRI that has definitions of morphological abnormalities, together with image depictions. We pull those relationships, as well as our local mapping of equivalences between EOM and HP terminologies.
The website is crawled monthly by NIF’s DISCO crawler system, which we utilize here. Be sure to have pg user/password connection details in your conf.json file, like: dbauth : {‘disco’ : {‘user’ : ‘<username>’, ‘password’ : ‘<password>’}}
Monarch-curated data for the HP to EOM mapping is stored at https://raw.githubusercontent.com/obophenotype/human-phenotype-ontology/master/src/mappings/hp-to-eom-mapping.tsv
Since this resource is so small, the entirety of it is the “test” set.
-
fetch
(is_dl_forced=False)¶ create the connection details for DISCO
-
files
= {'map': {'url': 'https://raw.githubusercontent.com/obophenotype/human-phenotype-ontology/master/src/mappings/hp-to-eom-mapping.tsv', 'file': 'hp-to-eom-mapping.tsv'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ Over ride Source.parse inherited via PostgreSQLSource
-
tables
= ['dvp.pr_nlx_157874_1']¶
-
-
class
dipper.sources.Ensembl.
Ensembl
(graph_type, are_bnodes_skolemized, tax_ids=None, gene_ids=None)¶ Bases:
dipper.sources.Source.Source
This is the processing module for Ensembl.
It only includes methods to acquire the equivalences between NCBIGene and ENSG ids using ENSEMBL’s Biomart services.
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
fetch_protein_gene_map
(taxon_id)¶ Fetch a list of proteins for a species in biomart :param taxid: :return: dict
-
fetch_protein_list
(taxon_id)¶ Fetch a list of proteins for a species in biomart :param taxid: :return: list
-
fetch_uniprot_gene_map
(taxon_id)¶ Fetch a dict of uniprot-gene for a species in biomart :param taxid: :return: dict
-
files
= {'10090': {'file': 'ensembl_10090.txt'}, '10116': {'file': 'ensembl_10116.txt'}, '13616': {'file': 'ensembl_13616.txt'}, '28377': {'file': 'ensembl_28377.txt'}, '31033': {'file': 'ensembl_31033.txt'}, '3702': {'file': 'ensembl_3702.txt'}, '44689': {'file': 'ensembl_44689.txt'}, '4896': {'file': 'ensembl_4896.txt'}, '4932': {'file': 'ensembl_4932.txt'}, '6239': {'file': 'ensembl_6239.txt'}, '7227': {'file': 'ensembl_7227.txt'}, '7955': {'file': 'ensembl_7955.txt'}, '8364': {'file': 'ensembl_8364.txt'}, '9031': {'file': 'ensembl_9031.txt'}, '9258': {'file': 'ensembl_9258.txt'}, '9544': {'file': 'ensembl_9544.txt'}, '9606': {'file': 'ensembl_9606.txt'}, '9615': {'file': 'ensembl_9615.txt'}, '9796': {'file': 'ensembl_9796.txt'}, '9823': {'file': 'ensembl_9823.txt'}, '9913': {'file': 'ensembl_9913.txt'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
-
class
dipper.sources.FlyBase.
FlyBase
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.PostgreSQLSource.PostgreSQLSource
This is the [Drosophila Genetics](http://www.flybase.org/) resource, from which we process genotype and phenotype data about fruitfly. Genotypes leverage the GENO genotype model.
Here, we connect to their public database, and download a subset of tables/views to get specifically at the geno-pheno data, then iterate over the tables. We end up effectively performing joins when adding nodes to the graph. We connect using the [Direct Chado Access](http://gmod.org/wiki/Public_Chado_Databases#Direct_Chado_Access)
When running the whole set, it performs best by dumping raw triples using the flag
`--format nt`
.-
fetch
(is_dl_forced=False)¶ Returns:
-
files
= {'disease_models': {'url': 'ftp://ftp.flybase.net/releases/current/precomputed_files/human_disease/allele_human_disease_model_data_fb_*.tsv.gz', 'file': 'allele_human_disease_model_data.tsv.gz'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ We process each of the postgres tables in turn. The order of processing is important here, as we build up a hashmap of internal vs external identifers (unique keys by type to FB id). These include allele, marker (gene), publication, strain, genotype, annotation (association), and descriptive notes. :param limit: Only parse this many lines of each table :return:
-
querys
= {'feature': "\n SELECT feature_id, dbxref_id, organism_id, name, uniquename,\n null as residues, seqlen, md5checksum, type_id, is_analysis,\n timeaccessioned, timelastmodified\n FROM feature WHERE is_analysis = false and is_obsolete = 'f'\n ", 'feature_dbxref_WIP': ' -- 17M rows in ~2 minutes\n SELECT\n feature.name feature_name, feature.uniquename feature_id,\n organism.abbreviation abbrev, organism.genus, organism.species,\n cvterm.name frature_type, db.name db, dbxref.accession\n FROM feature_dbxref\n JOIN dbxref ON feature_dbxref.dbxref_id = dbxref.dbxref_id\n JOIN db ON dbxref.db_id = db.db_id\n JOIN feature ON feature_dbxref.feature_id = feature.feature_id\n JOIN organism ON feature.organism_id = organism.organism_id\n JOIN cvterm ON feature.type_id = cvterm.cvterm_id\n WHERE feature_dbxref.is_current = true\n AND feature.is_analysis = false\n AND feature.is_obsolete = false\n AND cvterm.is_obsolete = 0\n ;\n '}¶
-
resources
= [{'outfile': 'feature_relationship', 'query': '../../resources/sql/fb/feature_relationship.sql'}, {'outfile': 'stockprop', 'query': '../../resources/sql/fb/stockprop.sql'}]¶
-
tables
= ['genotype', 'feature_genotype', 'pub', 'feature_pub', 'pub_dbxref', 'feature_dbxref', 'cvterm', 'stock_genotype', 'stock', 'organism', 'organism_dbxref', 'environment', 'phenotype', 'phenstatement', 'dbxref', 'phenotype_cvterm', 'phendesc', 'environment_cvterm']¶
-
test_keys
= {'allele': [29677937, 23174110, 23230960, 23123654, 23124718, 23146222, 29677936, 23174703, 11384915, 11397966, 53333044, 23189969, 3206803, 29677937, 29677934, 23256689, 23213050, 23230614, 23274987, 53323093, 40362726, 11380755, 11380754, 23121027, 44425218, 28298666], 'annot': [437783, 437784, 437785, 437786, 437789, 437796, 459885, 436779, 436780, 479826], 'feature': [11411407, 53361578, 53323094, 40377849, 40362727, 11379415, 61115970, 11380753, 44425219, 44426878, 44425220], 'gene': [23220066, 10344219, 58107328, 3132660, 23193483, 3118401, 3128715, 3128888, 23232298, 23294450, 3128626, 23255338, 8350351, 41994592, 3128715, 3128432, 3128840, 3128650, 3128654, 3128602, 3165464, 23235262, 3165510, 3153563, 23225695, 54564652, 3111381, 3111324], 'genotype': [267393, 267400, 130147, 168516, 111147, 200899, 46696, 328131, 328132, 328134, 328136, 381024, 267411, 327436, 197293, 373125, 361163, 403038], 'notes': [], 'organism': [1, 226, 456], 'pub': [359867, 327373, 153054, 153620, 370777, 154315, 345909, 365672, 366057, 11380753], 'strain': [8117, 3649, 64034, 213, 30131]}¶
-
-
class
dipper.sources.GWASCatalog.
GWASCatalog
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
The NHGRI-EBI Catalog of published genome-wide association studies.
We link the variants recorded here to the curated EFO-classes using a “contributes_to” linkage because the only thing we know is that the SNPs are associated with the trait/disease, but we don’t know if it is actually causative.
Description of the GWAS catalog is here: http://www.ebi.ac.uk/gwas/docs/fileheaders#_file_headers_for_catalog_version_1_0_1
GWAS also pulishes Owl files described here http://www.ebi.ac.uk/gwas/docs/ontology
Status: IN PROGRESS
-
GWASFILE
= 'gwas-catalog-associations_ontology-annotated.tsv'¶
-
GWASFTP
= 'ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/latest'¶
-
fetch
(is_dl_forced=False)¶ Parameters: is_dl_forced – Returns:
-
files
= {'catalog': {'url': 'ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/latest/gwas-catalog-associations_ontology-annotated.tsv', 'file': 'gwas-catalog-associations_ontology-annotated.tsv'}, 'efo': {'url': 'http://www.ebi.ac.uk/efo/efo.owl', 'file': 'efo.owl'}, 'so': {'url': 'http://purl.obolibrary.org/obo/so.owl', 'file': 'so.owl'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
process_catalog
(limit=None)¶ Parameters: limit – Returns:
-
terms
= {'age': 'EFO:0000246', 'cell_line_repository': 'CLO:0000008', 'collection': 'ERO:0002190', 'ethnic_group': 'EFO:0001799', 'race': 'SIO:001015', 'sampling_time': 'EFO:0000689'}¶
-
-
class
dipper.sources.GeneOntology.
GeneOntology
(graph_type, are_bnodes_skolemized, tax_ids=None)¶ Bases:
dipper.sources.Source.Source
This is the parser for the [Gene Ontology Annotations](http://www.geneontology.org), from which we process gene-process/function/subcellular location associations.
We generate the GO graph to include the following information: * genes * gene-process * gene-function * gene-location
We process only a subset of the organisms:
Status: IN PROGRESS / INCOMPLETE
-
clean_db_prefix
(db)¶ Here, we map the GO-style prefixes with Monarch-style prefixes that are able to be processed by our curie_map. :param db: :return:
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
files
= {'10090': {'url': 'http://geneontology.org/gene-associations/gene_association.mgi.gz', 'file': 'gene_association.mgi.gz'}, '10116': {'url': 'http://geneontology.org/gene-associations/gene_association.rgd.gz', 'file': 'gene_association.rgd.gz'}, '4896': {'url': 'http://geneontology.org/gene-associations/gene_association.pombase.gz', 'file': 'gene_association.pombase.gz'}, '559292': {'url': 'http://geneontology.org/gene-associations/gene_association.sgd.gz', 'file': 'gene_association.sgd.gz'}, '6239': {'url': 'http://geneontology.org/gene-associations/gene_association.wb.gz', 'file': 'gene_association.wb.gz'}, '7227': {'url': 'http://geneontology.org/gene-associations/gene_association.fb.gz', 'file': 'gene_association.fb.gz'}, '7955': {'url': 'http://geneontology.org/gene-associations/gene_association.zfin.gz', 'file': 'gene_association.zfin.gz'}, '9031': {'url': 'http://geneontology.org/gene-associations/goa_chicken.gaf.gz', 'file': 'gene_association.goa_ref_chicken.gz'}, '9606': {'url': 'http://geneontology.org/gene-associations/goa_human.gaf.gz', 'file': 'gene_association.goa_ref_human.gz'}, '9615': {'url': 'http://geneontology.org/gene-associations/goa_dog.gaf.gz', 'file': 'gene_association.goa_dog.gz'}, '9823': {'url': 'http://geneontology.org/gene-associations/goa_pig.gaf.gz', 'file': 'gene_association.goa_ref_pig.gz'}, '9913': {'url': 'http://geneontology.org/gene-associations/goa_cow.gaf.gz', 'file': 'goa_cow.gaf.gz'}, 'go-references': {'url': 'http://www.geneontology.org/doc/GO.references', 'file': 'GO.references'}, 'id-map': {'url': 'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/idmapping_selected.tab.gz', 'file': 'idmapping_selected.tab.gz'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
get_uniprot_entrez_id_map
()¶
-
map_files
= {'eco_map': 'http://purl.obolibrary.org/obo/eco/gaf-eco-mapping.txt'}¶
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
process_gaf
(file, limit, id_map=None, eco_map=None)¶
-
-
class
dipper.sources.GeneReviews.
GeneReviews
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
Here we process the GeneReviews mappings to OMIM, plus inspect the GeneReviews (html) books to pull the clinical descriptions in order to populate the definitions of the terms in the ontology. We define the GeneReviews items as classes that are either grouping classes over OMIM disease ids (gene ids are filtered out), or are made as subclasses of DOID:4 (generic disease).
Note that GeneReviews [copyright policy](http://www.ncbi.nlm.nih.gov/books/NBK138602/) (as of 2015.11.20) says:
GeneReviews® chapters are owned by the University of Washington, Seattle, © 1993-2015. Permission is hereby granted to reproduce, distribute, and translate copies of content materials provided that (i) credit for source (www.ncbi.nlm.nih.gov/books/NBK1116/) and copyright (University of Washington, Seattle) are included with each copy; (ii) a link to the original material is provided whenever the material is published elsewhere on the Web; and (iii) reproducers, distributors, and/or translators comply with this copyright notice and the GeneReviews Usage Disclaimer.
This script doesn’t pull the GeneReviews books from the NCBI Bookshelf directly; scripting this task is expressly prohibited by [NCBIBookshelf policy](http://www.ncbi.nlm.nih.gov/books/NBK45311/). However, assuming you have acquired the books (in html format) via permissible means, a parser for those books is provided here to extract the clinical descriptions to define the NBK identified classes.
-
create_books
()¶
-
fetch
(is_dl_forced=False)¶ We fetch GeneReviews id-label map and id-omim mapping files from NCBI. :return: None
-
files
= {'idmap': {'url': 'http://ftp.ncbi.nih.gov/pub/GeneReviews/NBKid_shortname_OMIM.txt', 'file': 'NBKid_shortname_OMIM.txt'}, 'titles': {'url': 'http://ftp.ncbi.nih.gov/pub/GeneReviews/GRtitle_shortname_NBKid.txt', 'file': 'GRtitle_shortname_NBKid.txt'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ Returns: None
-
process_nbk_html
(limit)¶ Here we process the gene reviews books to fetch the clinical descriptions to include in the ontology. We only use books that have been acquired manually, as NCBI Bookshelf does not permit automated downloads. This parser will only process the books that are found in the
`raw/genereviews/books`
directory, permitting partial completion.Parameters: limit – Returns:
-
-
class
dipper.sources.HGNC.
HGNC
(graph_type, are_bnodes_skolemized, tax_ids=None, gene_ids=None)¶ Bases:
dipper.sources.Source.Source
This is the processing module for HGNC.
We create equivalences between HGNC identifiers and ENSEMBL and NCBIGene. We also add the links to cytogenic locations for the gene features.
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
files
= {'genes': {'url': 'ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt', 'file': 'hgnc_complete_set.txt'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
get_symbol_id_map
()¶ A convenience method to create a mapping between the HGNC symbols and their identifiers. :return:
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
-
class
dipper.sources.HPOAnnotations.
HPOAnnotations
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
The [Human Phenotype Ontology](http://human-phenotype-ontology.org) group curates and assembles over 115,000 annotations to hereditary diseases using the HPO ontology. Here we create OBAN-style associations between diseases and phenotypic features, together with their evidence, and age of onset and frequency (if known). The parser currently only processes the “abnormal” annotations. Association to “remarkable normality” will be added in the near future.
We create additional associations from text mining. See info at http://pubmed-browser.human-phenotype-ontology.org/.
Also, you can read about these annotations in [PMID:26119816](http://www.ncbi.nlm.nih.gov/pubmed/26119816).
In order to properly test this class, you should have a conf.json file configured with some test ids, in the structure of: # as examples. put your favorite ids in the config. <pre> test_ids: {“disease” : [“OMIM:119600”, “OMIM:120160”]} </pre>
-
add_common_files_to_file_list
()¶
-
eco_dict
= {'ICE': 'ECO:0000305', 'IEA': 'ECO:0000501', 'ITM': 'ECO:0000246', 'PCS': 'ECO:0000269', 'TAS': 'ECO:0000304'}¶
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
files
= {'annot': {'url': 'http://compbio.charite.de/hudson/job/hpo.annotations/lastStableBuild/artifact/misc/phenotype_annotation.tab', 'file': 'phenotype_annotation.tab'}, 'doid': {'url': 'http://purl.obolibrary.org/obo/doid.owl', 'file': 'doid.owl'}, 'version': {'url': 'http://compbio.charite.de/hudson/job/hpo.annotations/lastStableBuild/artifact/misc/data_version.txt', 'file': 'data_version.txt'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
get_common_files
()¶ Fetch the raw hpo-annotation-data by cloning/pulling the [repository](https://github.com/monarch-initiative/hpo-annotation-data.git) These files get added to the files object, and iterated over separately. :return:
-
get_doid_ids_for_unpadding
()¶ Here, we fetch the doid owl file, and get all the doids. We figure out which are not zero-padded, so we can map the DOID to the correct identifier when processing the common annotation files.
This may become obsolete when https://github.com/monarch-initiative/hpo-annotation-data/issues/84 is addressed.
Returns:
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
process_all_common_disease_files
(limit=None)¶ Loop through all of the files that we previously fetched from git, creating the disease-phenotype assoc. :param limit: :return:
-
process_common_disease_file
(raw, unpadded_doids, limit=None)¶ Make disaese-phenotype associations. Some identifiers need clean up: * DOIDs are listed as DOID-DOID: –> DOID: * DOIDs may be unnecessarily zero-padded. these are remapped to their non-padded equivalent.
Parameters: - raw –
- unpadded_doids –
- limit –
Returns:
-
scrub
()¶ Perform various data-scrubbing on the raw data files prior to parsing. For this resource, this currently includes: * revise errors in identifiers for some OMIM and PMIDs
Returns: None
-
-
class
dipper.sources.IMPC.
IMPC
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
From the [IMPC](http://mousephenotype.org) website: The IMPC is generating a knockout mouse strain for every protein coding gene by using the embryonic stem cell resource generated by the International Knockout Mouse Consortium (IKMC). Systematic broad-based phenotyping is performed by each IMPC center using standardized procedures found within the International Mouse Phenotyping Resource of Standardised Screens (IMPReSS) resource. Gene-to-phenotype associations are made by a versioned statistical analysis with all data freely available by this web portal and by several data download features.
Here, we pull the data and model the genotypes using GENO and the genotype-to-phenotype associations using the OBAN schema.
We use all identifiers given by the IMPC with a few exceptions:
- For identifiers that IMPC provides, but does not resolve,
we instantiate them as Blank Nodes. Examples include things with the pattern of: UROALL, EUROCURATE, NULL-*,
- We mint three identifiers:
- Intrinsic genotypes not including sex, based on:
- colony_id (ES cell line + phenotyping center)
- strain
- zygosity
- For the Effective genotypes that are attached to the phenotypes:
- colony_id (ES cell line + phenotyping center)
- strain
- zygosity
- sex
3. Associations based on: effective_genotype_id + phenotype_id + phenotyping_center + pipeline_stable_id + procedure_stable_id + parameter_stable_id
We DO NOT yet add the assays as evidence for the G2P associations here. To be added in the future.
-
compare_checksums
()¶ test to see if fetched file matches checksum from ebi :return: True or False
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
files
= {'all': {'url': 'ftp://ftp.ebi.ac.uk/pub/databases/impc/latest/csv/ALL_genotype_phenotype.csv.gz', 'file': 'ALL_genotype_phenotype.csv.gz'}, 'checksum': {'url': 'ftp://ftp.ebi.ac.uk/pub/databases/impc/latest/csv/checksum.md5', 'file': 'checksum.md5'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
map_files
= {'impc_map': '../../resources/impc_mappings.yaml', 'impress_map': 'https://data.monarchinitiative.org/dipper/cache/impress_codes.json'}¶
-
parse
(limit=None)¶ IMPC data is delivered in three separate csv files OR in one integrated file, each with the same file format.
Parameters: limit – Returns:
-
parse_checksum_file
(file)¶ :param file :return dict
-
test_ids
= ['MGI:109380', 'MGI:1347004', 'MGI:1353495', 'MGI:1913840', 'MGI:2144157', 'MGI:2182928', 'MGI:88456', 'MGI:96704', 'MGI:1913649', 'MGI:95639', 'MGI:1341847', 'MGI:104848', 'MGI:2442444', 'MGI:2444584', 'MGI:1916948', 'MGI:107403', 'MGI:1860086', 'MGI:1919305', 'MGI:2384936', 'MGI:88135', 'MGI:1913367', 'MGI:1916571', 'MGI:2152453', 'MGI:1098270']¶
-
class
dipper.sources.KEGG.
KEGG
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
files
= {'cel_orthologs': {'url': 'http://rest.kegg.jp/link/orthology/cel', 'file': 'cel_orthologs'}, 'disease': {'url': 'http://rest.genome.jp/list/disease', 'file': 'disease'}, 'disease_gene': {'url': 'http://rest.kegg.jp/link/disease/hsa', 'file': 'disease_gene'}, 'dme_orthologs': {'url': 'http://rest.kegg.jp/link/orthology/dme', 'file': 'dme_orthologs'}, 'dre_orthologs': {'url': 'http://rest.kegg.jp/link/orthology/dre', 'file': 'dre_orthologs'}, 'hsa_gene2pathway': {'url': 'http://rest.kegg.jp/link/pathway/hsa', 'file': 'human_gene2pathway'}, 'hsa_genes': {'url': 'http://rest.genome.jp/list/hsa', 'file': 'hsa_genes'}, 'hsa_orthologs': {'url': 'http://rest.kegg.jp/link/orthology/hsa', 'file': 'hsa_orthologs'}, 'mmu_orthologs': {'url': 'http://rest.kegg.jp/link/orthology/mmu', 'file': 'mmu_orthologs'}, 'ncbi': {'url': 'http://rest.genome.jp/conv/ncbi-geneid/hsa', 'file': 'ncbi'}, 'omim2disease': {'url': 'http://rest.genome.jp/link/disease/omim', 'file': 'omim2disease'}, 'omim2gene': {'url': 'http://rest.genome.jp/link/omim/hsa', 'file': 'omim2gene'}, 'ortholog_classes': {'url': 'http://rest.genome.jp/list/orthology', 'file': 'ortholog_classes'}, 'pathway': {'url': 'http://rest.genome.jp/list/pathway', 'file': 'pathway'}, 'pathway_disease': {'url': 'http://rest.kegg.jp/link/pathway/ds', 'file': 'pathway_disease'}, 'pathway_ko': {'url': 'http://rest.kegg.jp/link/pathway/ko', 'file': 'pathway_ko'}, 'pathway_pubmed': {'url': 'http://rest.kegg.jp/link/pathway/pubmed', 'file': 'pathway_pubmed'}, 'rno_orthologs': {'url': 'http://rest.kegg.jp/link/orthology/rno', 'file': 'rno_orthologs'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ Parameters: limit – Returns:
-
test_ids
= {'disease': ['ds:H00015', 'ds:H00026', 'ds:H00712', 'ds:H00736', 'ds:H00014'], 'genes': ['hsa:100506275', 'hsa:285958', 'hsa:286410', 'hsa:6387', 'hsa:1080', 'hsa:11200', 'hsa:1131', 'hsa:1137', 'hsa:126', 'hsa:1277', 'hsa:1278', 'hsa:1285', 'hsa:1548', 'hsa:1636', 'hsa:1639', 'hsa:183', 'hsa:185', 'hsa:1910', 'hsa:207', 'hsa:2099', 'hsa:2483', 'hsa:2539', 'hsa:2629', 'hsa:2697', 'hsa:3161', 'hsa:3845', 'hsa:4137', 'hsa:4591', 'hsa:472', 'hsa:4744', 'hsa:4835', 'hsa:4929', 'hsa:5002', 'hsa:5080', 'hsa:5245', 'hsa:5290', 'hsa:53630', 'hsa:5630', 'hsa:5663', 'hsa:580', 'hsa:5888', 'hsa:5972', 'hsa:6311', 'hsa:64327', 'hsa:6531', 'hsa:6647', 'hsa:672', 'hsa:675', 'hsa:6908', 'hsa:7040', 'hsa:7045', 'hsa:7048', 'hsa:7157', 'hsa:7251', 'hsa:7490', 'hsa:7517', 'hsa:79728', 'hsa:83893', 'hsa:83990', 'hsa:841', 'hsa:8438', 'hsa:8493', 'hsa:860', 'hsa:9568', 'hsa:9627', 'hsa:9821', 'hsa:999', 'hsa:3460'], 'orthology_classes': ['ko:K00010', 'ko:K00027', 'ko:K00042', 'ko:K00088'], 'pathway': ['path:map00010', 'path:map00195', 'path:map00100', 'path:map00340', 'path:hsa05223']}¶
-
-
class
dipper.sources.MGI.
MGI
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.PostgreSQLSource.PostgreSQLSource
This is the [Mouse Genome Informatics](http://www.informatics.jax.org/) resource, from which we process genotype and phenotype data about laboratory mice. Genotypes leverage the GENO genotype model.
Here, we connect to their public database, and download a subset of tables/views to get specifically at the geno-pheno data, then iterate over the tables. We end up effectively performing joins when adding nodes to the graph. In order to use this parser, you will need to have user/password connection details in your conf.json file, like: dbauth : {‘mgi’ : {‘user’ : ‘<username>’, ‘password’ : ‘<password>’}} You can request access by contacting mgi-help@jax.org
-
fetch
(is_dl_forced=False)¶ For the MGI resource, we connect to the remote database, and pull the tables into local files. We’ll check the local table versions against the remote version :return:
-
fetch_transgene_genes_from_db
(cxn)¶ This is a custom query to fetch the non-mouse genes that are part of transgene alleles.
Parameters: cxn – Returns:
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ We process each of the postgres tables in turn. The order of processing is important here, as we build up a hashmap of internal vs external identifers (unique keys by type to MGI id). These include allele, marker (gene), publication, strain, genotype, annotation (association), and descriptive notes. :param limit: Only parse this many lines of each table :return:
-
process_mgi_note_allele_view
(limit=None)¶ These are the descriptive notes about the alleles. Note that these notes have embedded HTML - should we do anything about that? :param limit: :return:
-
process_mgi_relationship_transgene_genes
(limit=None)¶ Here, we have the relationship between MGI transgene alleles, and the non-mouse gene ids that are part of them. We augment the allele with the transgene parts.
Parameters: limit – Returns:
-
resources
= [{'Force': True, 'outfile': 'mgi_dbinfo', 'query': '../../resources/sql/mgi/mgi_dbinfo.sql'}, {'outfile': 'gxd_genotype_view', 'query': '../../resources/sql/mgi/gxd_genotype_view.sql'}, {'outfile': 'gxd_genotype_summary_view', 'query': '../../resources/sql/mgi/gxd_genotype_summary_view.sql'}, {'outfile': 'gxd_allelepair_view', 'query': '../../resources/sql/mgi/gxd_allelepair_view.sql'}, {'outfile': 'all_summary_view', 'query': '../../resources/sql/mgi/all_summary_view.sql'}, {'outfile': 'all_allele_view', 'query': '../../resources/sql/mgi/all_allele_view.sql'}, {'outfile': 'all_allele_mutation_view', 'query': '../../resources/sql/mgi/all_allele_mutation_view.sql'}, {'outfile': 'mrk_marker_view', 'query': '../../resources/sql/mgi/mrk_marker_view.sql'}, {'outfile': 'voc_annot_view', 'query': '../../resources/sql/mgi/voc_annot_view.sql'}, {'outfile': 'evidence_view', 'query': '../../resources/sql/mgi/evidence.sql'}, {'outfile': 'bib_acc_view', 'query': '../../resources/sql/mgi/bib_acc_view.sql'}, {'outfile': 'prb_strain_view', 'query': '../../resources/sql/mgi/prb_strain_view.sql'}, {'outfile': 'mrk_summary_view', 'query': '../../resources/sql/mgi/mrk_summary_view.sql'}, {'outfile': 'mrk_acc_view', 'query': '../../resources/sql/mgi/mrk_acc_view.sql'}, {'outfile': 'prb_strain_acc_view', 'query': '../../resources/sql/mgi/prb_strain_acc_view.sql'}, {'outfile': 'prb_strain_genotype_view', 'query': '../../resources/sql/mgi/prb_strain_genotype_view.sql'}, {'outfile': 'mgi_note_vocevidence_view', 'query': '../../resources/sql/mgi/mgi_note_vocevidence_view.sql'}, {'outfile': 'mgi_note_allele_view', 'query': '../../resources/sql/mgi/mgi_note_allele_view.sql'}, {'outfile': 'mrk_location_cache', 'query': '../../resources/sql/mgi/mrk_location_cache.sql'}]¶
-
test_keys
= {'allele': [1612, 1609, 1303, 56760, 816699, 51074, 14595, 816707, 246, 38139, 4334, 817387, 8567, 476, 42885, 3658, 1193, 6978, 6598, 16698, 626329, 33649, 835532, 7861, 33649, 6308, 1285, 827608], 'annot': [6778, 12035, 189442, 189443, 189444, 189445, 189446, 189447, 189448, 189449, 189450, 189451, 189452, 318424, 717023, 717024, 717025, 717026, 717027, 717028, 717029, 5123647, 928426, 5647502, 6173775, 6173778, 6173780, 6173781, 6620086, 13487622, 13487623, 13487624, 23241933, 23534428, 23535949, 23546035, 24722398, 29645663, 29645664, 29645665, 29645666, 29645667, 29645682, 43803707, 43804057, 43805682, 43815003, 43838073, 58485679, 59357863, 59357864, 59357865, 59357866, 59357867, 60448185, 60448186, 60448187, 62628962, 69611011, 69611253, 79642481, 79655585, 80436328, 83942519, 84201418, 90942381, 90942382, 90942384, 90942385, 90942386, 90942389, 90942390, 90942391, 90942392, 92947717, 92947729, 92947735, 92947757, 92948169, 92948441, 92948518, 92949200, 92949301, 93092368, 93092369, 93092370, 93092371, 93092372, 93092373, 93092374, 93092375, 93092376, 93092377, 93092378, 93092379, 93092380, 93092381, 93092382, 93401080, 93419639, 93436973, 93436974, 93436975, 93436976, 93436977, 93459094, 93459095, 93459096, 93459097, 93484431, 93484432, 93491333, 93491334, 93491335, 93491336, 93491337, 93510296, 93510297, 93510298, 93510299, 93510300, 93548463, 93551440, 93552054, 93576058, 93579091, 93579870, 93581813, 93581832, 93581841, 93581890, 93583073, 93583786, 93584586, 93587213, 93604448, 93607816, 93613038, 93614265, 93618579, 93620355, 93621390, 93624755, 93626409, 93626918, 93636629, 93642680, 93643814, 93643825, 93647695, 93648755, 93652704, 5123647, 71668107, 71668108, 71668109, 71668110, 71668111, 71668112, 71668113, 71668114, 74136778, 107386012, 58485691], 'genotype': [81, 87, 142, 206, 281, 283, 286, 287, 341, 350, 384, 406, 407, 411, 425, 457, 458, 461, 476, 485, 537, 546, 551, 553, 11702, 12910, 13407, 13453, 14815, 26655, 28610, 37313, 38345, 59766, 60082, 65406, 64235], 'marker': [357, 38043, 305574, 444020, 34578, 9503, 38712, 17679, 445717, 38415, 12944, 377, 77197, 18436, 30157, 14252, 412465, 38598, 185833, 35408, 118781, 37270, 31169, 25040, 81079], 'notes': [5114, 53310, 53311, 53312, 53313, 53314, 53315, 53316, 53317, 53318, 53319, 53320, 71099, 501751, 501752, 501753, 501754, 501755, 501756, 501757, 744108, 1055341, 6049949, 6621213, 6621216, 6621218, 6621219, 7108498, 14590363, 14590364, 14590365, 25123358, 25123360, 26688159, 32028545, 32028546, 32028547, 32028548, 32028549, 32028564, 37833486, 47742903, 47743253, 47744878, 47754199, 47777269, 65105483, 66144014, 66144015, 66144016, 66144017, 66144018, 70046116, 78382808, 78383050, 103920312, 103920318, 103920319, 103920320, 103920322, 103920323, 103920324, 103920325, 103920326, 103920328, 103920330, 103920331, 103920332, 103920333, 106390006, 106390018, 106390024, 106390046, 106390458, 106390730, 106390807, 106391489, 106391590, 106579450, 106579451, 106579452, 106579453, 106579454, 106579455, 106579456, 106579457, 106579458, 106579459, 106579460, 106579461, 106579462, 106579463, 106579464, 106949909, 106949910, 106969368, 106969369, 106996040, 106996041, 106996042, 106996043, 106996044, 107022123, 107022124, 107022125, 107022126, 107052057, 107052058, 107058959, 107058960, 107058961, 107058962, 107058963, 107077922, 107077923, 107077924, 107077925, 107077926, 107116089, 107119066, 107119680, 107154485, 107155254, 107158128, 107159385, 107160435, 107163154, 107163183, 107163196, 107163271, 107164877, 107165872, 107166942, 107168838, 107170557, 107174867, 107194346, 107198590, 107205179, 107206725, 107212120, 107214364, 107214911, 107215700, 107218519, 107218642, 107219974, 107221415, 107222064, 107222717, 107235068, 107237686, 107242709, 107244121, 107244139, 107248964, 107249091, 107250401, 107251870, 107255383, 107256603], 'pub': [73197, 165659, 134151, 76922, 181903, 26681, 128938, 80054, 156949, 159965, 53672, 170462, 206876, 87798, 100777, 176693, 139205, 73199, 74017, 102010, 152095, 18062, 216614, 61933, 13385, 32366, 114625, 182408, 140802], 'strain': [30639, 33832, 33875, 33940, 36012, 59504, 34338, 34382, 47670, 59802, 33946, 31421, 64, 40, 14, -2, 30639, 15975, 35077, 12610, -1, 28319, 27026, 141, 62299]}¶
-
-
class
dipper.sources.MGISlim.
MGISlim
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
slim mgi model only containing Gene to phenotype associations Uses mousemine: http://www.mousemine.org/mousemine/begin.do
-
fetch
(is_dl_forced)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
-
class
dipper.sources.MMRRC.
MMRRC
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
Here we process the Mutant Mouse Resource and Research Center (https://www.mmrrc.org) strain data, which includes: * strains, their mutant alleles * phenotypes of the alleles * descriptions of the research uses of the strains
Note that some gene identifiers are not included (for many of the transgenics with human genes) in the raw data. We do our best to process the links between the variant and the affected gene, but sometimes the mapping is not clear, and we do not include it. Many of these details will be solved by merging this source with the MGI data source, who has the variant-to-gene designations.
Also note that even though the strain pages at the MMRRC site do list phenotypic differences in the context of the strain backgrounds, they do not provide that data to us, and thus we cannot supply that disambiguation here.
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
files
= {'catalog': {'url': 'https://www.mmrrc.org/about/mmrrc_catalog_data.csv', 'file': 'mmrrc_catalog_data.csv'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
test_ids
= ['MMRRC:037507-MU', 'MMRRC:041175-UCD', 'MMRRC:036933-UNC', 'MMRRC:037884-UCD', 'MMRRC:000255-MU', 'MMRRC:037372-UCD', 'MMRRC:000001-UNC']¶
-
-
class
dipper.sources.MPD.
MPD
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
From the [MPD](http://phenome.jax.org/) website: This resource is a collaborative standardized collection of measured data on laboratory mouse strains and populations. Includes baseline phenotype data sets as well as studies of drug, diet, disease and aging effect. Also includes protocols, projects and publications, and SNP, variation and gene expression studies.
Here, we pull the data and model the genotypes using GENO and the genotype-to-phenotype associations using the OBAN schema.
MPD provide measurements for particular assays for several strains. Each of these measurements is itself mapped to a MP or VT term as a phenotype. Therefore, we can create a strain-to-phenotype association based on those strains that lie outside of the “normal” range for the given measurements. We can compute the average of the measurements for all strains tested, and then threshold any extreme measurements being beyond some threshold beyond the average.
Our default threshold here, is +/-2 standard deviations beyond the mean.
Because the measurements are made and recorded at the level of a specific sex of each strain, we associate the MP/VT phenotype with the sex-qualified genotype/strain.
-
MPDDL
= 'http://phenomedoc.jax.org/MPD_downloads'¶
-
static
build_measurement_description
(row)¶
-
static
check_header
(filename, header)¶
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
files
= {'assay_metadata': {'url': 'http://phenomedoc.jax.org/MPD_downloads/measurements.csv', 'file': 'measurements.csv'}, 'ontology_mappings': {'url': 'http://phenomedoc.jax.org/MPD_downloads/ontology_mappings.csv', 'file': 'ontology_mappings.csv'}, 'straininfo': {'url': 'http://phenomedoc.jax.org/MPD_downloads/straininfo.csv', 'file': 'straininfo.csv'}, 'strainmeans': {'url': 'http://phenomedoc.jax.org/MPD_downloads/strainmeans.csv.gz', 'file': 'strainmeans.csv.gz'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
mgd_agent_id
= 'MPD:db/q?rtn=people/allinv'¶
-
mgd_agent_label
= 'Mouse Phenotype Database'¶
-
mgd_agent_type
= 'foaf:organization'¶
-
static
normalise_units
(units)¶
-
parse
(limit=None)¶ MPD data is delivered in four separate csv files and one xml file, which we process iteratively and write out as one large graph.
Parameters: limit – Returns:
-
test_ids
= ['MPD:6', 'MPD:849', 'MPD:425', 'MPD:569', 'MPD:10', 'MPD:1002', 'MPD:39', 'MPD:2319']¶
-
-
class
dipper.sources.Monarch.
Monarch
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
This is the parser for data curated by the [Monarch Initiative](https://monarchinitiative.org). Data is currently maintained in a private repository, soon to be released.
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
process_omia_phenotypes
(limit)¶
-
-
class
dipper.sources.Monochrom.
Monochrom
(graph_type, are_bnodes_skolemized, tax_ids=None)¶ Bases:
dipper.sources.Source.Source
This class will leverage the GENO ontology and modeling patterns to build an ontology of chromosomes for any species. These classes represent major structural pieces of Chromosomes which are often universally referenced, using physical properties/observations that remain constant over different genome builds (such as banding patterns and arms). The idea is to create a scaffold upon which we can hang build-specific chromosomal coordinates, and reason across them.
In general, this will take the cytogenic bands files from UCSC, and create missing grouping classes, in order to build the partonomy from a very specific chromosomal band up through the chromosome itself and enable overlap and containment queries. We use RO:subsequence_of as our relationship between nested chromosomal parts. For example, 13q21.31 ==> 13q21.31, 13q21.3, 13q21, 13q2, 13q, 13
At the moment, this only computes the bands for Human, Mouse, Zebrafish, and Rat but will be expanding in the future as needed.
Because this is a universal framework to represent the chromosomal structure of any species, we must mint identifiers for each chromosome and part. We differentiate species by first creating a species-specific genome, then for each species-specific chromosome we include the NCBI taxon number together with the chromosome number, like:
`<species number>chr<num><band>`
. For 13q21.31, this would be 9606chr13q21.31. We then create triples for a given band like: <pre> CHR:9606chr1p36.33 rdf[type] SO:chromosome_band CHR:9606chr1p36 subsequence_of :9606chr1p36.3 </pre> where any band in the file is an instance of a chr_band (or a more specific type), is a subsequence of it’s containing region.We determine the containing regions of the band by parsing the band-string; since each alphanumeric is a significant “place”, we can split it with the shorter strings being parents of the longer string
Since this is small, and we have not limited other items in our test set to a small region, we simply use the whole graph (genome) for testing purposes, and copy the main graph to the test graph.
Since this Dipper class is building an ONTOLOGY, rather than instance-level data, we must also include domain and range constraints, and other owl-isms.
TODO: any species by commandline argument
We are currently mapping these to the CHR idspace, but this is NOT YET APPROVED and is subject to change.
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
files
= {'10090': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/cytoBandIdeo.txt.gz', 'build_num': 'mm10', 'genome_label': 'Mouse', 'file': '10090cytoBand.txt.gz'}, '10116': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/rn6/database/cytoBandIdeo.txt.gz', 'build_num': 'rn6', 'genome_label': 'Rat', 'file': '10116cytoBand.txt.gz'}, '7955': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/danRer10/database/cytoBandIdeo.txt.gz', 'build_num': 'danRer10', 'genome_label': 'Zebrafish', 'file': '7955cytoBand.txt.gz'}, '9031': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/galGal4/database/cytoBandIdeo.txt.gz', 'build_num': 'galGal4', 'genome_label': 'chicken', 'file': 'galGal4cytoBand.txt.gz'}, '9606': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz', 'build_num': 'hg19', 'genome_label': 'Human', 'file': '9606cytoBand.txt.gz'}, '9796': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/equCab2/database/cytoBandIdeo.txt.gz', 'build_num': 'equCab2', 'genome_label': 'horse', 'file': 'equCab2cytoBand.txt.gz'}, '9823': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/susScr3/database/cytoBandIdeo.txt.gz', 'build_num': 'susScr3', 'genome_label': 'pig', 'file': 'susScr3cytoBand.txt.gz'}, '9913': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/bosTau7/database/cytoBandIdeo.txt.gz', 'build_num': 'bosTau7', 'genome_label': 'cow', 'file': 'bosTau7cytoBand.txt.gz'}, '9940': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/oviAri3/database/cytoBandIdeo.txt.gz', 'build_num': 'oviAri3', 'genome_label': 'sheep', 'file': 'oviAri3cytoBand.txt.gz'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
make_parent_bands
(band, child_bands)¶ this will determine the grouping bands that it belongs to, recursively 13q21.31 ==> 13, 13q, 13q2, 13q21, 13q21.3, 13q21.31
Parameters: - band –
- child_bands –
Returns:
-
map_type_of_region
(regiontype)¶ Note that “stalk” refers to the short arm of acrocentric chromosomes chr13,14,15,21,22 for human. :param regiontype: :return:
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
region_type_map
= {'acen': 'SO:0000577', 'chromosome': 'SO:0000340', 'chromosome_arm': 'SO:0000105', 'chromosome_band': 'SO:0000341', 'chromosome_part': 'SO:0000830', 'gneg': 'SO:0000341', 'gpos100': 'SO:0000341', 'gpos25': 'SO:0000341', 'gpos33': 'SO:0000341', 'gpos50': 'SO:0000341', 'gpos66': 'SO:0000341', 'gpos75': 'SO:0000341', 'gvar': 'SO:0000341', 'stalk': 'SO:0000341'}¶
-
-
dipper.sources.Monochrom.
getChrPartTypeByNotation
(notation)¶ This method will figure out the kind of feature that a given band is based on pattern matching to standard karyotype notation. (e.g. 13q22.2 ==> chromosome sub-band)
This has been validated against human, mouse, fish, and rat nomenclature. :param notation: the band (without the chromosome prefix) :return:
-
class
dipper.sources.MyChem.
MyChem
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
-
static
add_relation
(results, relation)¶
-
static
check_uniprot
(target_dict)¶
-
static
chunks
(l, n)¶ Yield successive n-sized chunks from l.
-
static
execute_query
(query)¶
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
fetch_from_mychem
()¶
-
static
format_actions
(target_dict)¶
-
static
get_drug_record
(ids, fields)¶
-
static
get_inchikeys
()¶
-
make_triples
(source, package)¶
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
static
return_target_list
(targ_in)¶
-
static
-
class
dipper.sources.MyDrug.
MyDrug
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
Drugs and Compounds stored in the BioThings database
-
MY_DRUG_API
= 'http://c.biothings.io/v1/query'¶
-
checkIfRemoteIsNewer
(localfile)¶ Need to figure out how biothings records releases, for now if the file exists we will assume it is a fully downloaded cache :param localfile: str file path :return: boolean True if remote file is newer else False
-
fetch
(is_dl_forced=False)¶ Note there is a unpublished mydrug client that works like this: from mydrug import MyDrugInfo md = MyDrugInfo() r = list(md.query(‘_exists_:aeolus’, fetch_all=True))
Parameters: is_dl_forced – boolean, force download Returns:
-
files
= {'aeolus': {'file': 'aeolus.json'}}¶
-
parse
(limit=None, or_limit=1)¶ Parse mydrug files :param limit: int limit json docs processed :param or_limit: int odds ratio limit :return: None
-
-
class
dipper.sources.NCBIGene.
NCBIGene
(graph_type, are_bnodes_skolemized, tax_ids=None, gene_ids=None)¶ Bases:
dipper.sources.Source.Source
This is the processing module for the National Center for Biotechnology Information. It includes parsers for the gene_info (gene names, symbols, ids, equivalent ids), gene history (alt ids), and gene2pubmed publication references about a gene.
This creates Genes as classes, when they are properly typed as such. For those entries where it is an ‘unknown significance’, it is added simply as an instance of a sequence feature. It will add equivalentClasses for a subset of external identifiers, including: ENSEMBL, HGMD, MGI, ZFIN, and gene product links for HPRD. They are additionally located to their Chromosomal band (until we process actual genomic coords in a separate file).
We process the genes from the filtered taxa, starting with those configured by default (human, mouse, fish). This can be overridden in the calling script to include additional taxa, if desired. The gene ids in the conf.json will be used to subset the data when testing.
All entries in the gene_history file are added as deprecated classes, and linked to the current gene id, with “replaced_by” relationships.
Since we do not know much about the specific link in the gene2pubmed; we simply create a “mentions” relationship.
-
SCIGRAPH_BASE
= 'https://scigraph-ontology-dev.monarchinitiative.org/scigraph/graph/'¶
-
add_orthologs_by_gene_group
(graph, gene_ids)¶ This will get orthologies between human and other vertebrate genomes based on the gene_group annotation pipeline from NCBI. More information 9can be learned here: http://www.ncbi.nlm.nih.gov/news/03-13-2014-gene-provides-orthologs-regions/ The method for associations is described in [PMCID:3882889](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3882889/) == [PMID:24063302](http://www.ncbi.nlm.nih.gov/pubmed/24063302/). Because these are only between human and vertebrate genomes, they will certainly miss out on very distant orthologies, and should not be considered complete.
We do not run this within the NCBI parser itself; rather it is a convenience function for others parsers to call.
Parameters: - graph –
- gene_ids – Gene ids to fetch the orthology
Returns:
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
files
= {'gene2pubmed': {'url': 'http://ftp.ncbi.nih.gov/gene/DATA/gene2pubmed.gz', 'file': 'gene2pubmed.gz'}, 'gene_group': {'url': 'http://ftp.ncbi.nih.gov/gene/DATA/gene_group.gz', 'file': 'gene_group.gz'}, 'gene_history': {'url': 'http://ftp.ncbi.nih.gov/gene/DATA/gene_history.gz', 'file': 'gene_history.gz'}, 'gene_info': {'url': 'http://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz', 'file': 'gene_info.gz'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
static
map_type_of_gene
(sotype)¶
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
resources
= {'clique_leader': '../../resources/clique_leader.yaml'}¶
-
-
class
dipper.sources.OMA.
OMA
(graph_type, are_bnodes_skolemized, tax_ids=None)¶ Bases:
dipper.sources.OrthoXML.OrthoXML
-
BENCHMARK_BASE
= 'https://omabrowser.org/ReferenceProteomes'¶
-
files
= {'oma_hogs': {'url': 'https://omabrowser.org/ReferenceProteomes/OMA_GETHOGs-2_2017-04.orthoxml.gz', 'file': 'OMA_GETHOGs-2_2017-04.orthoxml.gz'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
-
class
dipper.sources.OMIA.
OMIA
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
This is the parser for the [Online Mendelian Inheritance in Animals (OMIA)](http://www.http://omia.angis.org.au), from which we process inherited disorders, other (single-locus) traits, and genes in >200 animal species (other than human and mouse and rats).
We generate the omia graph to include the following information: * genes * animal taxonomy, and breeds as instances of those taxa
(breeds are akin to “strains” in other taxa)- animal diseases, along with species-specific subtypes of those diseases
- publications (and their mapping to PMIDs, if available)
- gene-to-phenotype associations (via an anonymous variant-locus
- breed-to-phenotype associations
We make links between OMIA and OMIM in two ways: 1. mappings between OMIA and OMIM are created as OMIA –> hasdbXref OMIM 2. mappings between a breed and OMIA disease are created
to be a model for the mapped OMIM disease, IF AND ONLY IF it is a 1:1 mapping. there are some 1:many mappings, and these often happen if the OMIM item is a gene.Because many of these species are not covered in the PANTHER orthology datafiles, we also pull any orthology relationships from the gene_group files from NCBI.
-
clean_up_omim_genes
()¶
-
fetch
(is_dl_forced=False)¶ Parameters: is_dl_forced – Returns:
-
files
= {'data': {'url': 'http://compldb.angis.org.au/dumps/omia.xml.gz', 'file': 'omia.xml.gz'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
make_breed_id
(key)¶
-
map_omia_group_category_to_ontology_id
(category_num)¶ Using the category number in the OMIA_groups table, map them to a disease id. This may be superceeded by other MONDO methods.
Platelet disorders will be more specific once https://github.com/obophenotype/human-disease-ontology/issues/46 is fulfilled.
Parameters: category_num – Returns:
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
process_associations
(limit)¶ Loop through the xml file and process the article-breed, article-phene, breed-phene, phene-gene associations, and the external links to LIDA.
Parameters: limit – Returns:
-
process_classes
(limit)¶ Loop through the xml file and process the articles, breed, genes, phenes, and phenotype-grouping classes. We add elements to the graph, and store the id-to-label in the label_hash dict, along with the internal key-to-external id in the id_hash dict. The latter are referenced in the association processing functions.
Parameters: limit – Returns:
-
process_species
(limit)¶ Loop through the xml file and process the species. We add elements to the graph, and store the id-to-label in the label_hash dict. :param limit: :return:
-
scrub
()¶ The XML file seems to have mixed-encoding; we scrub out the control characters from the file for processing.
i.e.?i omia.xml:1555328.28: PCDATA invalid Char value 2 <field name=”journal”>Bulletin et Memoires de la Societe Centrale de Medic
Returns:
-
write_molgen_report
()¶
-
class
dipper.sources.OMIM.
OMIM
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
The only anonymously obtainable data from the ftp site is mim2gene. However, more detailed information is available via their API. So, we pull the omim identifiers from their ftp site, then query their API in batchs of 20. Their prescribed rate limits have been mecurial
one per two seconds or four per second, in 2017 November all mention of api rate limits have vanished (save 20 IDs per call if any include is used)Note this ingest requires an api Key which is not stored in the repo, but in a separate conf.json file.
Processing this source serves two purposes: 1. the creation of the OMIM classes for merging into the disease ontology 2. add annotations such as disease-gene associations
When creating the disease classes, we pull from their REST-api id/label/definition information. Additionally we pull the Orphanet and UMLS mappings (to make equivalent ids). We also pull the phenotypic series annotations as grouping classes.
-
fetch
(is_dl_forced=True)¶ Get the preconfigured static files. This DOES NOT fetch the individual records via REST…that is handled in the parsing function. (To be refactored.) over riding Source.fetch() calling Source.get_files() :param is_dl_forced: :return:
-
files
= {'all': {'url': 'https://omim.org/static/omim/data/mim2gene.txt', 'clean': 'https://data.omim.org/downloads/', 'file': 'mim2gene.txt'}, 'morbidmap': {'url': 'https://data.omim.org/downloads//morbidmap.txt', 'clean': 'https://data.omim.org/downloads/', 'file': 'morbidmap.txt'}, 'phenotypicSeries': {'url': 'https://omim.org/phenotypicSeriesTitle/all?format=tsv', 'headers': {'User-Agent': 'The Monarch Initiative (https://monarchinitiative.org/; info@monarchinitiative.org)'}, 'clean': 'https://data.omim.org/downloads/', 'file': 'phenotypic_series_title_all.txt'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
process_entries
(omimids, transform, included_fields=None, graph=None, limit=None)¶ Given a list of omim ids, this will use the omim API to fetch the entries, according to the
`included_fields`
passed as a parameter. If a transformation function is supplied, this will iterate over each entry, and either add the results to the supplied`graph`
or will return a set of processed entries that the calling function can further iterate.If no
`included_fields`
are provided, this will simply fetch the basic entry from omim, which includes an entry’s: prefix, mimNumber, status, and titles.Parameters: - omimids – the set of omim entry ids to fetch using their API
- transform – Function to transform each omim entry when looping
- included_fields – A set of what fields are required to retrieve from the API
- graph – the graph to add the transformed data into
Returns:
-
test_ids
= [119600, 120160, 157140, 158900, 166220, 168600, 219700, 253250, 305900, 600669, 601278, 602421, 605073, 607822, 102560, 102480, 100678, 102750, 600201, 104200, 105400, 114480, 115300, 121900, 107670, 11600, 126453, 102150, 104000, 107200, 100070, 611742, 611100, 102480]¶
-
-
dipper.sources.OMIM.
filter_keep_phenotype_entry_ids
(entry, graph=None)¶
-
dipper.sources.OMIM.
get_omim_id_from_entry
(entry)¶
-
class
dipper.sources.Orphanet.
Orphanet
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
Orphanet’s aim is to help improve the diagnosis, care and treatment of patients with rare diseases. For Orphanet, we are currently only parsing the disease-gene associations.
Note that ???-
fetch
(is_dl_forced=False)¶ Parameters: is_dl_forced – Returns:
-
files
= {'disease-gene': {'url': 'http://www.orphadata.org/data/xml/en_product6.xml', 'file': 'en_product6.xml'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
-
class
dipper.sources.OrthoXML.
OrthoXML
(graph_type, are_bnodes_skolemized, method, tax_ids=None)¶ Bases:
dipper.sources.Source.Source
Extract the induced pairwise relations from an OrthoXML file.
This base class is primarily intended to extract the orthologous and paralogous relations from a file in OrthoXML file containing the QfO reference species data set.
A concreate method should subclass this class and overwrite the constructor method to provide the information about the dataset and a method name.
-
add_protein_to_graph
¶ adds protein nodes to the graph and adds a “in_taxon” triple.
for efficency reasons, we cache which proteins we have already added using a least recently used cache.
-
clean_protein_id
(protein_id)¶ makes sure protein_id is properly prefixed
-
extract_taxon_info
(gene_node)¶ extract the ncbi taxon id from a gene_node
default implementation goes up to the species node in the xml and extracts the id from the attribute at that node.
-
fetch
(is_dl_forced=False)¶ Returns: None
-
files
= {}¶
-
parse
(limit=None)¶ Returns: None
-
-
class
dipper.sources.Panther.
Panther
(graph_type, are_bnodes_skolemized, tax_ids=None)¶ Bases:
dipper.sources.Source.Source
The pairwise orthology calls from Panther DB: http://pantherdb.org/ encompass 22 species, from the RefGenome and HCOP projects. Here, we map the orthology classes to RO homology relationships This resource may be extended in the future with additional species.
This currently makes a graph of orthologous relationships between genes, with the assumption that gene metadata (labels, equivalent ids) are provided from other sources.
Gene families are nominally created from the orthology files, though these are incomplete with no hierarchical (subfamily) information. This will get updated from the HMM files in the future.
Note that there is a fair amount of identifier cleanup performed to align with our standard CURIE prefixes.
The test graph of data is output based on configured “protein” identifiers in conf.json.
By default, this will produce a file with ALL orthologous relationships. IF YOU WANT ONLY A SUBSET, YOU NEED TO PROVIDE A FILTER UPON CALLING THIS WITH THE TAXON IDS
-
PNTHDL
= 'ftp://ftp.pantherdb.org/ortholog/current_release'¶
-
fetch
(is_dl_forced=False)¶ Returns: None
-
files
= {'hcop': {'url': 'ftp://ftp.pantherdb.org/ortholog/current_release/Orthologs_HCOP.tar.gz', 'file': 'Orthologs_HCOP.tar.gz'}, 'refgenome': {'url': 'ftp://ftp.pantherdb.org/ortholog/current_release/RefGenomeOrthologs.tar.gz', 'file': 'RefGenomeOrthologs.tar.gz'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ Returns: None
-
-
class
dipper.sources.PostgreSQLSource.
PostgreSQLSource
(graph_type, are_bnodes_skolemized, name=None)¶ Bases:
dipper.sources.Source.Source
Class for interfacing with remote Postgres databases
-
fetch_from_pgdb
(tables, cxn, limit=None, force=False)¶ - Will fetch all Postgres tables from the specified database
- in the cxn connection parameters.
- This will save them to a local file named the same as the table,
- in tab-delimited format, including a header.
Parameters: - tables – Names of tables to fetch
- cxn – database connection details
- limit – A max row count to fetch for each table
Returns: None
-
fetch_query_from_pgdb
(qname, query, con, cxn, limit=None, force=False)¶ Supply either an already established connection, or connection parameters. The supplied connection will override any separate cxn parameter :param qname: The name of the query to save the output to :param query: The SQL query itself :param con: The already-established connection :param cxn: The postgres connection information :param limit: If you only want a subset of rows from the query :return:
-
-
class
dipper.sources.RGD.
RGD
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
Ingest of Rat Genome Database gene to mammalian phenotype gaf file
-
RGD_BASE
= 'ftp://ftp.rgd.mcw.edu/pub/data_release/annotated_rgd_objects_by_ontology/'¶
-
fetch
(is_dl_forced=False)¶ Override Source.fetch() Fetches resources from rat_genome_database using the rat_genome_database ftp site Args:
param is_dl_forced (bool): Force download - Returns:
- :return None
-
files
= {'rat_gene2mammalian_phenotype': {'url': 'ftp://ftp.rgd.mcw.edu/pub/data_release/annotated_rgd_objects_by_ontology/rattus_genes_mp', 'file': 'rattus_genes_mp'}}¶
-
make_association
(record)¶ contstruct the association :param record: :return: modeled association of genotype to mammalian phenotype
-
parse
(limit=None)¶ Override Source.parse() Args:
:param limit (int, optional) limit the number of rows processed- Returns:
- :return None
-
-
class
dipper.sources.Reactome.
Reactome
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
Reactome is a free, open-source, curated and peer reviewed pathway database. (http://reactome.org/)
-
REACTOME_BASE
= 'http://www.reactome.org/download/current/'¶
-
fetch
(is_dl_forced=False)¶ Override Source.fetch() Fetches resources from reactome using the Reactome.files dictionary Args:
param is_dl_forced (bool): Force download - Returns:
- :return None
-
files
= {'chebi2pathway': {'url': 'http://www.reactome.org/download/current/ChEBI2Reactome.txt', 'file': 'ChEBI2Reactome.txt'}, 'ensembl2pathway': {'url': 'http://www.reactome.org/download/current/Ensembl2Reactome.txt', 'file': 'Ensembl2Reactome.txt'}}¶
-
map_files
= {'eco_map': 'http://purl.obolibrary.org/obo/eco/gaf-eco-mapping.txt'}¶
-
parse
(limit=None)¶ Override Source.parse() Args:
:param limit (int, optional) limit the number of rows processed- Returns:
- :return None
-
-
class
dipper.sources.SGD.
SGD
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
Ingest of Saccharomyces Genome Database (SGD) phenotype associations
-
SGD_BASE
= 'https://downloads.yeastgenome.org/curation/literature/'¶
-
fetch
(is_dl_forced=False)¶ Override Source.fetch() Fetches resources from rat_genome_database using the rat_genome_database ftp site Args:
param is_dl_forced (bool): Force download - Returns:
- :return None
-
files
= {'sgd_phenotype': {'url': 'https://downloads.yeastgenome.org/curation/literature/phenotype_data.tab', 'file': 'phenotype_data.tab'}}¶
-
static
make_apo_map
()¶
-
make_association
(record)¶ contstruct the association :param record: :return: modeled association of genotype to mammalian phenotype
-
parse
(limit=None)¶ Override Source.parse() Args:
:param limit (int, optional) limit the number of rows processed- Returns:
- :return None
-
-
class
dipper.sources.Source.
Source
(graph_type, are_bnodes_skized=False, name=None)¶ Bases:
object
Abstract class for any data sources that we’ll import and process. Each of the subclasses will fetch() the data, scrub() it as necessary, then parse() it into a graph. The graph will then be written out to a single self.name().<dest_fmt> file.
-
checkIfRemoteIsNewer
(remote, local, headers)¶ Given a remote file location, and the corresponding local file this will check the datetime stamp on the files to see if the remote one is newer. This is a convenience method to be used so that we don’t have to re-fetch files that we already have saved locally :param remote: URL of file to fetch from remote server :param local: pathname to save file to locally :return: True if the remote file is newer and should be downloaded
-
compare_local_remote_bytes
(remotefile, localfile, remote_headers=None)¶ test to see if fetched file is the same size as the remote file using information in the content-length field in the HTTP header :return: True or False
-
declareAsOntology
(graph)¶ The file we output needs to be declared as an ontology, including it’s version information.
TEC: I am not convinced dipper reformating external data as RDF triples makes an OWL ontology (nor that it should be considered a goal).
Proper ontologies are built by ontologists. Dipper reformats data and anotates/decorates it with a minimal set of carefully arranged terms drawn from from multiple proper ontologies. Which allows the whole (dipper’s RDF triples and parent ontologies) to function as a single ontology we can reason over when combined in a store such as SciGraph.
Including more than the minimal ontological terms in dipper’s RDF output constitutes a liability as it allows greater divergence between dipper artifacts and the proper ontologies.
Further information will be augmented in the dataset object. :param version: :return:
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
fetch_from_url
(remotefile, localfile=None, is_dl_forced=False, headers=None)¶ Given a remote url and a local filename, attempt to determine if the remote file is newer; if it is, fetch the remote file and save it to the specified localfile, reporting the basic file information once it is downloaded :param remotefile: URL of remote file to fetch :param localfile: pathname of file to save locally :return: None
-
file_len
(fname)¶
-
files
= {}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
static
get_eco_map
(url)¶ To conver the three column file to a hashmap we join primary and secondary keys, for example IEA GO_REF:0000002 ECO:0000256 IEA GO_REF:0000003 ECO:0000501 IEA Default ECO:0000501
becomes IEA-GO_REF:0000002: ECO:0000256 IEA-GO_REF:0000003: ECO:0000501 IEA: ECO:0000501
Returns: dict
-
get_file_md5
(directory, file, blocksize=1048576)¶
-
get_files
(is_dl_forced, files=None)¶ Given a set of files for this source, it will go fetch them, and set a default version by date. If you need to set the version number by another method, then it can be set again. :param is_dl_forced - boolean :param files dict - override instance files dict :return: None
-
get_local_file_size
(localfile)¶ Parameters: localfile – Returns: size of file
-
get_remote_content_len
(remote, headers=None)¶ Parameters: remote – Returns: size of remote file
-
static
hash_id
(long_string)¶ prepend ‘b’ to avoid leading with digit truncate to 64bit sized word return sha1 hash of string :param long_string: str string to be hashed :return: str hash of id
-
static
make_id
(long_string, prefix='MONARCH')¶ a method to create DETERMINISTIC identifiers based on a string’s digest. currently implemented with sha1 :param long_string: :return:
-
namespaces
= {}¶
-
static
open_and_parse_yaml
(file)¶ Parameters: file – String, path to file containing label-id mappings in the first two columns of each row Returns: dict where keys are labels and values are ids
-
parse
(limit)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
static
parse_mapping_file
(file)¶ Parameters: file – String, path to file containing label-id mappings in the first two columns of each row Returns: dict where keys are labels and values are ids
-
process_xml_table
(elem, table_name, processing_function, limit)¶ This is a convenience function to process the elements of an xml document, when the xml is used as an alternative way of distributing sql-like tables. In this case, the “elem” is akin to an sql table, with it’s name of
`table_name`
. It will then process each`row`
given the`processing_function`
supplied.Parameters: - elem – The element data
- table_name – The name of the table to process
- processing_function – The row processing function
- limit –
Appears to be making calls to the elementTree library although it not explicitly imported here.
Returns:
-
static
remove_backslash_r
(filename, encoding)¶ A helpful utility to remove Carriage Return from any file. This will read a file into memory, and overwrite the contents of the original file.
TODO: This function may be a liability
Parameters: filename – Returns:
-
settestmode
(mode)¶ Set testMode to (mode). - True: run the Source in testMode; - False: run it in full mode :param mode: :return: None
-
settestonly
(testonly)¶ Set that this source should only be processed in testMode :param testOnly: :return: None
-
whoami
()¶
-
write
(fmt='turtle', stream=None)¶ This convenience method will write out all of the graphs associated with the source. Right now these are hardcoded to be a single “graph” and a “src_dataset.ttl” and a “src_test.ttl” If you do not supply stream=’stdout’ it will default write these to files.
In addition, if the version number isn’t yet set in the dataset, it will be set to the date on file. :return: None
-
-
class
dipper.sources.StringDB.
StringDB
(graph_type, are_bnodes_skolemized, tax_ids=None, version=None)¶ Bases:
dipper.sources.Source.Source
STRING is a database of known and predicted protein-protein interactions. The interactions include direct (physical) and indirect (functional) associations; they stem from computational prediction, from knowledge transfer between organisms, and from interactions aggregated from other (primary) databases. From: http://string-db.org/cgi/about.pl?footer_active_subpage=content
STRING uses one protein per gene. If there is more than one isoform per gene, we usually select the longest isoform, unless we have information that suggest that other isoform regarded as cannonical (e.g., proteins in the CCDS database). From: http://string-db.org/cgi/help.pl
-
DEFAULT_TAXA
= [9606, 10090, 7955, 7227, 6239]¶
-
STRING_BASE
= 'http://string-db.org/download/'¶
-
fetch
(is_dl_forced=False)¶ Override Source.fetch() Fetches resources from String
We also fetch ensembl to determine if protein pairs are from the same species Args:
param is_dl_forced (bool): Force download - Returns:
- :return None
-
parse
(limit=None)¶ Override Source.parse() Args:
:param limit (int, optional) limit the number of rows processed- Returns:
- :return None
-
-
class
dipper.sources.UCSCBands.
UCSCBands
(graph_type, are_bnodes_skolemized, tax_ids=None)¶ Bases:
dipper.sources.Source.Source
This will take the UCSC defintions of cytogenic bands and create the nested structures to enable overlap and containment queries. We use
`Monochrom.py`
to create the OWL-classes of the chromosomal parts. Here, we simply worry about the instance-level values for particular genome builds.Given a chr band definition, the nested containment structures look like: 13q21.31 ==> 13q21.31, 13q21.3, 13q21, 13q2, 13q, 13
We determine the containing regions of the band by parsing the band-string; since each alphanumeric is a significant “place”, we can split it with the shorter strings being parents of the longer string. # Here we create build-specific chroms, which are instances of the classes produced from
`Monochrom.py`
. You can instantiate any number of builds for a genome.We leverage the Faldo model here for region definitions, and map each of the chromosomal parts to SO.
We differentiate the build by adding the build id to the identifier prior to the chromosome number. These then are instances of the species-specific chromosomal class.
The build-specific chromosomes are created like: <pre> <build number>chr<num><band> with triples for a given band like: _:hg19chr1p36.33
rdfs:type SO:chromosome_band, faldo:Region, CHR:9606chr1p36.33, subsequence_of _:hg19chr1p36.3, faldo:location [ a faldo:BothStrandPosition
faldo:begin 0, faldo:end 2300000, faldo:reference ‘hg19’] .
</pre> where any band in the file is an instance of a chr_band (or a more specific type), is a subsequence of it’s containing region, and is located in the specified coordinates.
We do not have a separate graph for testing.
TODO: any species by commandline argument
-
HGGP
= 'http://hgdownload.cse.ucsc.edu/goldenPath'¶
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
files
= {'10090': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/cytoBandIdeo.txt.gz', 'build_num': 'mm10', 'genome_label': 'Mouse', 'file': 'mm10cytoBand.txt.gz'}, '7955': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/danRer10/database/cytoBandIdeo.txt.gz', 'build_num': 'danRer10', 'genome_label': 'Zebrafish', 'file': 'danRer10cytoBand.txt.gz'}, '9031': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/galGal4/database/cytoBandIdeo.txt.gz', 'build_num': 'galGal4', 'genome_label': 'chicken', 'file': 'galGal4cytoBand.txt.gz'}, '9606': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz', 'build_num': 'hg19', 'genome_label': 'Human', 'file': 'hg19cytoBand.txt.gz'}, '9796': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/equCab2/database/cytoBandIdeo.txt.gz', 'build_num': 'equCab2', 'genome_label': 'horse', 'file': 'equCab2cytoBand.txt.gz'}, '9823': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/susScr3/database/cytoBandIdeo.txt.gz', 'build_num': 'susScr3', 'genome_label': 'pig', 'file': 'susScr3cytoBand.txt.gz'}, '9913': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/bosTau7/database/cytoBandIdeo.txt.gz', 'build_num': 'bosTau7', 'genome_label': 'cow', 'file': 'bosTau7cytoBand.txt.gz'}, '9940': {'url': 'http://hgdownload.cse.ucsc.edu/goldenPath/oviAri3/database/cytoBandIdeo.txt.gz', 'build_num': 'oviAri3', 'genome_label': 'sheep', 'file': 'oviAri3cytoBand.txt.gz'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
-
class
dipper.sources.UDP.
UDP
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
The National Institutes of Health (NIH) Undiagnosed Diseases Program (UDP) is part of the Undiagnosed Disease Network (UDN), an NIH Common Fund initiative that focuses on the most puzzling medical cases referred to the NIH Clinical Center in Bethesda, Maryland. from https://www.genome.gov/27544402/the-undiagnosed-diseases-program/
Data is available by request for access via the NHGRI collaboration server: https://udplims-collab.nhgri.nih.gov/api
Note the fetcher requires credentials for the UDP collaboration server Credentials are added via a config file, config.json, in the following format {
- “dbauth” : {
- “udp”: {
- “user”: “foo” “password”: “bar”
}
} See dipper/config.py for more information
Output of fetcher: udp_variants.tsv ‘Patient’, ‘Family’, ‘Chr’, ‘Build’, ‘Chromosome Position’, ‘Reference Allele’, ‘Variant Allele’, ‘Parent of origin’, ‘Allele Type’, ‘Mutation Type’, ‘Gene’, ‘Transcript’, ‘Original Amino Acid’, ‘Variant Amino Acid’, ‘Amino Acid Change’, ‘Segregates with’, ‘Position’, ‘Exon’, ‘Inheritance model’, ‘Zygosity’, ‘dbSNP ID’, ‘1K Frequency’, ‘Number of Alleles’
udp_phenotypes.tsv ‘Patient’, ‘HPID’, ‘Present’
The script also utilizes two mapping files udp_gene_map.tsv - generated from scripts/fetch-gene-ids.py,
gene symbols from udp_variants- udp_chr_rs.tsv - rsid(s) per coordinate greped from hg19 dbsnp file,
- then disambiguated with eutils, see scripts/dbsnp/dbsnp.py
-
UDP_SERVER
= 'https://udplims-collab.nhgri.nih.gov/api'¶
-
fetch
(is_dl_forced=True)¶ Fetches data from udp collaboration server, see top level comments for class for more information :return:
-
files
= {'patient_phenotypes': {'file': 'udp_phenotypes.tsv'}, 'patient_variants': {'file': 'udp_variants.tsv'}}¶
-
map_files
= {'dbsnp_map': '../../resources/udp/udp_chr_rs.tsv', 'gene_coord_map': '../../resources/udp/gene_coordinates.tsv', 'patient_ids': '../../resources/udp/patient_ids.yaml'}¶
-
parse
(limit=None)¶ Override Source.parse() Args:
:param limit (int, optional) limit the number of rows processed- Returns:
- :return None
-
class
dipper.sources.WormBase.
WormBase
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
This is the parser for the [C. elegans Model Organism Database (WormBase)](http://www.wormbase.org), from which we process genotype and phenotype data for laboratory worms (C.elegans and other nematodes).
We generate the wormbase graph to include the following information: * genes * sequence alterations (includes SNPs/del/ins/indel and
large chromosomal rearrangements)- RNAi as expression-affecting reagents
- genotypes, and their components
- strains
- publications (and their mapping to PMIDs, if available)
- allele-to-phenotype associations (including variants by RNAi)
- genetic positional information for genes and sequence alterations
Genotypes leverage the GENO genotype model and includes both intrinsic and extrinsic genotypes. Where necessary, we create anonymous nodes of the genotype partonomy (i.e. for variant single locus complements, genomic variation complements, variant loci, extrinsic genotypes, and extrinsic genotype parts).
TODO: get people and gene expression
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
files
= {'allele_pheno': {'url': 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/ONTOLOGY/phenotype_association.WSNUMBER.wb', 'file': 'phenotype_association.wb'}, 'checksums': {'url': 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/CHECKSUMS', 'file': 'CHECKSUMS'}, 'disease_assoc': {'url': 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/ONTOLOGY/disease_association.WSNUMBER.wb', 'file': 'disease_association.wb'}, 'feature_loc': {'url': 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/PRJNA13758/c_elegans.PRJNA13758.WSNUMBER.annotations.gff3.gz', 'file': 'c_elegans.PRJNA13758.annotations.gff3.gz'}, 'gene_ids': {'url': 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WSNUMBER.geneIDs.txt.gz', 'file': 'c_elegans.PRJNA13758.geneIDs.txt.gz'}, 'pub_xrefs': {'url': 'http://tazendra.caltech.edu/~azurebrd/cgi-bin/forms/generic.cgi?action=WpaXref', 'file': 'pub_xrefs.txt'}, 'rnai_pheno': {'url': 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/ONTOLOGY/rnai_phenotypes.WSNUMBER.wb', 'file': 'rnai_phenotypes.wb'}, 'xrefs': {'url': 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WSNUMBER.xrefs.txt.gz', 'file': 'c_elegans.PRJNA13758.xrefs.txt.gz'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
get_feature_type_by_class_and_biotype
(ftype, biotype)¶
-
make_reagent_targeted_gene_id
(gene_id, reagent_id)¶
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
process_allele_phenotype
(limit=None)¶ This file compactly lists variant to phenotype associations, such that in a single row, there may be >1 variant listed per phenotype and paper. This indicates that each variant is individually assocated with the given phenotype, as listed in 1+ papers. (Not that the combination of variants is producing the phenotype.) :param limit: :return:
-
process_disease_association
(limit)¶
-
process_feature_loc
(limit)¶
-
process_gene_desc
(limit)¶
-
process_gene_ids
(limit)¶
-
process_gene_interaction
(limit)¶ The gene interaction file includes identified interactions, that are between two or more gene (products). In the case of interactions with >2 genes, this requires creating groups of genes that are involved in the interaction. From the wormbase help list: In the example WBInteraction000007779 it would likely be misleading to suggest that lin-12 interacts with (suppresses in this case) smo-1 ALONE or that lin-12 suppresses let-60 ALONE; the observation in the paper; see Table V in paper PMID:15990876 was that a lin-12 allele (heterozygous lin-12(n941/+)) could suppress the “multivulva” phenotype induced synthetically by simultaneous perturbation of BOTH smo-1 (by RNAi) AND let-60 (by the n2021 allele). So this is necessarily a three-gene interaction.
Therefore, we can create groups of genes based on their “status” of Effector | Effected.
Status: IN PROGRESS
Parameters: limit – Returns:
-
process_pub_xrefs
(limit=None)¶
-
process_rnai_phenotypes
(limit=None)¶
-
species
= '/species/c_elegans/PRJNA13758'¶
-
test_ids
= {'allele': ['WBVar00087800', 'WBVar00087742', 'WBVar00144481', 'WBVar00248869', 'WBVar00250630'], 'gene': ['WBGene00001414', 'WBGene00004967', 'WBGene00003916', 'WBGene00004397', 'WBGene00001531'], 'pub': [], 'strain': ['BA794', 'RK1', 'HE1006']}¶
-
update_wsnum_in_files
(vernum)¶ With the given version number
`vernum`
, update the source’s version number, and replace in the file hashmap. the version number is in the CHECKSUMS file. :param vernum: :return:
-
wbdev
= 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-development-release'¶
-
wbprod
= 'ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release'¶
-
wbrel
= 'ftp://ftp.wormbase.org/pub/wormbase/releases'¶
-
class
dipper.sources.ZFIN.
ZFIN
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
This is the parser for the [Zebrafish Model Organism Database (ZFIN)](http://www.zfin.org), from which we process genotype and phenotype data for laboratory zebrafish.
We generate the zfin graph to include the following information: * genes * sequence alterations (includes SNPs/del/ins/indel and large chromosomal rearrangements) * transgenic constructs * morpholinos, talens, crisprs as expression-affecting reagents * genotypes, and their components * fish (as comprised of intrinsic and extrinsic genotypes) * publications (and their mapping to PMIDs, if available) * genotype-to-phenotype associations (including environments and stages at which they are assayed) * environmental components * orthology to human genes * genetic positional information for genes and sequence alterations * fish-to-disease model associations
Genotypes leverage the GENO genotype model and include both intrinsic and extrinsic genotypes. Where necessary, we create anonymous nodes of the genotype partonomy (such as for variant single locus complements, genomic variation complements, variant loci, extrinsic genotypes, and extrinsic genotype parts).
Furthermore, we process the genotype components to build labels in a monarch-style. This leads to genotype labels that include: * all genes targeted by reagents (morphants, crisprs, etc), in addition to the ones that the reagent was designed against. * all affected genes within deficiencies * complex hets being listed as gene<mutation1>/gene<mutation2> rather than gene<mutation1>/+; gene<mutation2>/+
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
files
= {'backgrounds': {'url': 'http://zfin.org/downloads/genotype_backgrounds.txt', 'file': 'genotype_backgrounds.txt'}, 'crispr': {'url': 'http://zfin.org/downloads/CRISPR.txt', 'file': 'CRISPR.txt'}, 'enviro': {'url': 'http://zfin.org/downloads/pheno_environment_fish.txt', 'file': 'pheno_environment_fish.txt'}, 'feature_affected_gene': {'url': 'http://zfin.org/downloads/features-affected-genes.txt', 'file': 'features-affected-genes.txt'}, 'features': {'url': 'http://zfin.org/downloads/features.txt', 'file': 'features.txt'}, 'fish_components': {'url': 'http://zfin.org/downloads/fish_components_fish.txt', 'file': 'fish_components_fish.txt'}, 'fish_disease_models': {'url': 'http://zfin.org/downloads/fish_model_disease.txt', 'file': 'fish_model_disease.txt'}, 'genbank': {'url': 'http://zfin.org/downloads/genbank.txt', 'file': 'genbank.txt'}, 'gene': {'url': 'http://zfin.org/downloads/gene.txt', 'file': 'gene.txt'}, 'gene_coordinates': {'url': 'http://zfin.org/downloads/E_zfin_gene_alias.gff3', 'file': 'E_zfin_gene_alias.gff3'}, 'gene_marker_rel': {'url': 'http://zfin.org/downloads/gene_marker_relationship.txt', 'file': 'gene_marker_relationship.txt'}, 'geno': {'url': 'http://zfin.org/downloads/genotype_features.txt', 'file': 'genotype_features.txt'}, 'human_orthos': {'url': 'http://zfin.org/downloads/human_orthos.txt', 'file': 'human_orthos.txt'}, 'mappings': {'url': 'http://zfin.org/downloads/mappings.txt', 'file': 'mappings.txt'}, 'morph': {'url': 'http://zfin.org/downloads/Morpholinos.txt', 'file': 'Morpholinos.txt'}, 'pheno': {'url': 'http://zfin.org/downloads/phenotype_fish.txt', 'file': 'phenotype_fish.txt'}, 'pub2pubmed': {'url': 'http://zfin.org/downloads/pub_to_pubmed_id_translation.txt', 'file': 'pub_to_pubmed_id_translation.txt'}, 'pubs': {'url': 'http://zfin.org/downloads/zfinpubs.txt', 'file': 'zfinpubs.txt'}, 'stage': {'url': 'http://zfin.org/Downloads/stage_ontology.txt', 'file': 'stage_ontology.txt'}, 'talen': {'url': 'http://zfin.org/downloads/TALEN.txt', 'file': 'TALEN.txt'}, 'uniprot': {'url': 'http://zfin.org/downloads/uniprot.txt', 'file': 'uniprot.txt'}, 'wild': {'url': 'http://zfin.org/downloads/wildtypes_fish.txt', 'file': 'wildtypes.txt'}, 'zpmap': {'url': 'http://compbio.charite.de/hudson/job/zp-owl-new/lastSuccessfulBuild/artifact/zp.annot_sourceinfo', 'file': 'zp-mapping.txt'}}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
get_orthology_evidence_code
(abbrev)¶
-
get_orthology_sources_from_zebrafishmine
()¶ Fetch the zfin gene to other species orthology annotations, together with the evidence for the assertion. Write the file locally to be read in a separate function. :return:
-
static
make_targeted_gene_id
(geneid, reagentid)¶
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
process_fish
(limit=None)¶ Fish give identifiers to the “effective genotypes” that we create. We can match these by: Fish = (intrinsic) genotype + set of morpholinos
We assume here that the intrinsic genotypes and their parts will be processed separately, prior to calling this function.
Parameters: limit – Returns:
-
process_fish_disease_models
(limit=None)¶
-
process_orthology_evidence
(limit)¶
-
scrub
()¶ Perform various data-scrubbing on the raw data files prior to parsing. For this resource, this currently includes: * remove oddities where there are “” instead of empty strings :return: None
-
test_ids
= {'allele': ['ZDB-ALT-010426-4', 'ZDB-ALT-010427-8', 'ZDB-ALT-011017-8', 'ZDB-ALT-051005-2', 'ZDB-ALT-051227-8', 'ZDB-ALT-060221-2', 'ZDB-ALT-070314-1', 'ZDB-ALT-070409-1', 'ZDB-ALT-070420-6', 'ZDB-ALT-080528-1', 'ZDB-ALT-080528-6', 'ZDB-ALT-080827-15', 'ZDB-ALT-080908-7', 'ZDB-ALT-090316-1', 'ZDB-ALT-100519-1', 'ZDB-ALT-111024-1', 'ZDB-ALT-980203-1374', 'ZDB-ALT-980203-412', 'ZDB-ALT-980203-465', 'ZDB-ALT-980203-470', 'ZDB-ALT-980203-605', 'ZDB-ALT-980413-636', 'ZDB-ALT-021021-2', 'ZDB-ALT-080728-1', 'ZDB-ALT-100729-1', 'ZDB-ALT-980203-1560', 'ZDB-ALT-001127-6', 'ZDB-ALT-001129-2', 'ZDB-ALT-980203-1091', 'ZDB-ALT-070118-2', 'ZDB-ALT-991005-33', 'ZDB-ALT-020918-2', 'ZDB-ALT-040913-6', 'ZDB-ALT-980203-1827', 'ZDB-ALT-090504-6', 'ZDB-ALT-121218-1'], 'environment': ['ZDB-EXP-050202-1', 'ZDB-EXP-071005-3', 'ZDB-EXP-071227-14', 'ZDB-EXP-080428-1', 'ZDB-EXP-080428-2', 'ZDB-EXP-080501-1', 'ZDB-EXP-080805-7', 'ZDB-EXP-080806-5', 'ZDB-EXP-080806-8', 'ZDB-EXP-080806-9', 'ZDB-EXP-081110-3', 'ZDB-EXP-090505-2', 'ZDB-EXP-100330-7', 'ZDB-EXP-100402-1', 'ZDB-EXP-100402-2', 'ZDB-EXP-100422-3', 'ZDB-EXP-100511-5', 'ZDB-EXP-101025-12', 'ZDB-EXP-101025-13', 'ZDB-EXP-110926-4', 'ZDB-EXP-110927-1', 'ZDB-EXP-120809-5', 'ZDB-EXP-120809-7', 'ZDB-EXP-120809-9', 'ZDB-EXP-120913-5', 'ZDB-EXP-130222-13', 'ZDB-EXP-130222-7', 'ZDB-EXP-130904-2', 'ZDB-EXP-041102-1', 'ZDB-EXP-140822-13', 'ZDB-EXP-041102-1', 'ZDB-EXP-070129-3', 'ZDB-EXP-110929-7', 'ZDB-EXP-100520-2', 'ZDB-EXP-100920-3', 'ZDB-EXP-100920-5', 'ZDB-EXP-090601-2', 'ZDB-EXP-151116-3'], 'fish': ['ZDB-FISH-150901-17912', 'ZDB-FISH-150901-18649', 'ZDB-FISH-150901-26314', 'ZDB-FISH-150901-9418', 'ZDB-FISH-150901-14591', 'ZDB-FISH-150901-9997', 'ZDB-FISH-150901-23877', 'ZDB-FISH-150901-22128', 'ZDB-FISH-150901-14869', 'ZDB-FISH-150901-6695', 'ZDB-FISH-150901-24158', 'ZDB-FISH-150901-3631', 'ZDB-FISH-150901-20836', 'ZDB-FISH-150901-1060', 'ZDB-FISH-150901-8451', 'ZDB-FISH-150901-2423', 'ZDB-FISH-150901-20257', 'ZDB-FISH-150901-10002', 'ZDB-FISH-150901-12520', 'ZDB-FISH-150901-14833', 'ZDB-FISH-150901-2104', 'ZDB-FISH-150901-6607', 'ZDB-FISH-150901-1409'], 'gene': ['ZDB-GENE-000616-6', 'ZDB-GENE-000710-4', 'ZDB-GENE-030131-2773', 'ZDB-GENE-030131-8769', 'ZDB-GENE-030219-146', 'ZDB-GENE-030404-2', 'ZDB-GENE-030826-1', 'ZDB-GENE-030826-2', 'ZDB-GENE-040123-1', 'ZDB-GENE-040426-1309', 'ZDB-GENE-050522-534', 'ZDB-GENE-060503-719', 'ZDB-GENE-080405-1', 'ZDB-GENE-081211-2', 'ZDB-GENE-091118-129', 'ZDB-GENE-980526-135', 'ZDB-GENE-980526-166', 'ZDB-GENE-980526-196', 'ZDB-GENE-980526-265', 'ZDB-GENE-980526-299', 'ZDB-GENE-980526-41', 'ZDB-GENE-980526-437', 'ZDB-GENE-980526-44', 'ZDB-GENE-980526-481', 'ZDB-GENE-980526-561', 'ZDB-GENE-980526-89', 'ZDB-GENE-990415-181', 'ZDB-GENE-990415-72', 'ZDB-GENE-990415-75', 'ZDB-GENE-980526-44', 'ZDB-GENE-030421-3', 'ZDB-GENE-980526-196', 'ZDB-GENE-050320-62', 'ZDB-GENE-061013-403', 'ZDB-GENE-041114-104', 'ZDB-GENE-030131-9700', 'ZDB-GENE-031114-1', 'ZDB-GENE-990415-72', 'ZDB-GENE-030131-2211', 'ZDB-GENE-030131-3063', 'ZDB-GENE-030131-9460', 'ZDB-GENE-980526-26', 'ZDB-GENE-980526-27', 'ZDB-GENE-980526-29', 'ZDB-GENE-071218-6', 'ZDB-GENE-070912-423', 'ZDB-GENE-011207-1', 'ZDB-GENE-980526-284', 'ZDB-GENE-980526-72', 'ZDB-GENE-991129-7', 'ZDB-GENE-000607-83', 'ZDB-GENE-090504-2'], 'genotype': ['ZDB-GENO-010426-2', 'ZDB-GENO-010427-3', 'ZDB-GENO-010427-4', 'ZDB-GENO-050209-30', 'ZDB-GENO-051018-1', 'ZDB-GENO-070209-80', 'ZDB-GENO-070215-11', 'ZDB-GENO-070215-12', 'ZDB-GENO-070228-3', 'ZDB-GENO-070406-1', 'ZDB-GENO-070712-5', 'ZDB-GENO-070917-2', 'ZDB-GENO-080328-1', 'ZDB-GENO-080418-2', 'ZDB-GENO-080516-8', 'ZDB-GENO-080606-609', 'ZDB-GENO-080701-2', 'ZDB-GENO-080713-1', 'ZDB-GENO-080729-2', 'ZDB-GENO-080804-4', 'ZDB-GENO-080825-3', 'ZDB-GENO-091027-1', 'ZDB-GENO-091027-2', 'ZDB-GENO-091109-1', 'ZDB-GENO-100325-3', 'ZDB-GENO-100325-4', 'ZDB-GENO-100325-5', 'ZDB-GENO-100325-6', 'ZDB-GENO-100524-2', 'ZDB-GENO-100601-2', 'ZDB-GENO-100910-1', 'ZDB-GENO-111025-3', 'ZDB-GENO-120522-18', 'ZDB-GENO-121210-1', 'ZDB-GENO-130402-5', 'ZDB-GENO-980410-268', 'ZDB-GENO-080307-1', 'ZDB-GENO-960809-7', 'ZDB-GENO-990623-3', 'ZDB-GENO-130603-1', 'ZDB-GENO-001127-3', 'ZDB-GENO-001129-1', 'ZDB-GENO-090203-8', 'ZDB-GENO-070209-1', 'ZDB-GENO-070118-1', 'ZDB-GENO-140529-1', 'ZDB-GENO-070820-1', 'ZDB-GENO-071127-3', 'ZDB-GENO-000209-20', 'ZDB-GENO-980202-1565', 'ZDB-GENO-010924-10', 'ZDB-GENO-010531-2', 'ZDB-GENO-090504-5', 'ZDB-GENO-070215-11', 'ZDB-GENO-121221-1'], 'morpholino': ['ZDB-MRPHLNO-041129-1', 'ZDB-MRPHLNO-041129-2', 'ZDB-MRPHLNO-041129-3', 'ZDB-MRPHLNO-050308-1', 'ZDB-MRPHLNO-050308-3', 'ZDB-MRPHLNO-060508-2', 'ZDB-MRPHLNO-070118-1', 'ZDB-MRPHLNO-070522-3', 'ZDB-MRPHLNO-070706-1', 'ZDB-MRPHLNO-070725-1', 'ZDB-MRPHLNO-070725-2', 'ZDB-MRPHLNO-071005-1', 'ZDB-MRPHLNO-071227-1', 'ZDB-MRPHLNO-080307-1', 'ZDB-MRPHLNO-080428-1', 'ZDB-MRPHLNO-080430-1', 'ZDB-MRPHLNO-080919-4', 'ZDB-MRPHLNO-081110-3', 'ZDB-MRPHLNO-090106-5', 'ZDB-MRPHLNO-090114-1', 'ZDB-MRPHLNO-090505-1', 'ZDB-MRPHLNO-090630-11', 'ZDB-MRPHLNO-090804-1', 'ZDB-MRPHLNO-100728-1', 'ZDB-MRPHLNO-100823-6', 'ZDB-MRPHLNO-101105-3', 'ZDB-MRPHLNO-110323-3', 'ZDB-MRPHLNO-111104-5', 'ZDB-MRPHLNO-130222-4', 'ZDB-MRPHLNO-080430', 'ZDB-MRPHLNO-100823-6', 'ZDB-MRPHLNO-140822-1', 'ZDB-MRPHLNO-100520-4', 'ZDB-MRPHLNO-100520-5', 'ZDB-MRPHLNO-100920-3', 'ZDB-MRPHLNO-050604-1', 'ZDB-CRISPR-131113-1', 'ZDB-MRPHLNO-140430-12', 'ZDB-MRPHLNO-140430-13'], 'pub': ['PMID:11566854', 'PMID:12588855', 'PMID:12867027', 'PMID:14667409', 'PMID:15456722', 'PMID:16914492', 'PMID:17374715', 'PMID:17545503', 'PMID:17618647', 'PMID:17785424', 'PMID:18201692', 'PMID:18358464', 'PMID:18388326', 'PMID:18638469', 'PMID:18846223', 'PMID:19151781', 'PMID:19759004', 'PMID:19855021', 'PMID:20040115', 'PMID:20138861', 'PMID:20306498', 'PMID:20442775', 'PMID:20603019', 'PMID:21147088', 'PMID:21893049', 'PMID:21925157', 'PMID:22718903', 'PMID:22814753', 'PMID:22960038', 'PMID:22996643', 'PMID:23086717', 'PMID:23203810', 'PMID:23760954', 'ZFIN:ZDB-PUB-140303-33', 'ZFIN:ZDB-PUB-140404-9', 'ZFIN:ZDB-PUB-080902-16', 'ZFIN:ZDB-PUB-101222-7', 'ZFIN:ZDB-PUB-140614-2', 'ZFIN:ZDB-PUB-120927-26', 'ZFIN:ZDB-PUB-100504-5', 'ZFIN:ZDB-PUB-140513-341']}¶
-
-
class
dipper.sources.ZFINSlim.
ZFINSlim
(graph_type, are_bnodes_skolemized)¶ Bases:
dipper.sources.Source.Source
zfin mgi model only containing Gene to phenotype associations Using the file here: https://zfin.org/downloads/phenoGeneCleanData_fish.txt
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
files
= {'g2p_clean': {'url': 'https://zfin.org/downloads/phenoGeneCleanData_fish.txt', 'file': 'phenoGeneCleanData_fish.txt.txt'}, 'zpmap': {'url': 'http://compbio.charite.de/hudson/job/zp-owl-new/lastSuccessfulBuild/artifact/zp.annot_sourceinfo', 'file': 'zp-mapping.txt'}}¶
-
parse
(limit=None)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
dipper.utils package¶
-
class
dipper.utils.DipperUtil.
DipperUtil
¶ Bases:
object
Various utilities and quick methods used in this application
(A little too quick) Per: https://www.ncbi.nlm.nih.gov/books/NBK25497/ NCBI recommends that users post
no more than three URL requests per second and limit large jobs to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdaysrestructuring to make bulk queries is less likely to result in another ban for peppering them with one offs
-
static
get_homologene_by_gene_num
(gene_num)¶
-
static
get_ncbi_id_from_symbol
(gene_symbol)¶ Get ncbi gene id from symbol using monarch and mygene services :param gene_symbol: :return:
-
static
get_ncbi_taxon_num_by_label
(label)¶ Here we want to look up the NCBI Taxon id using some kind of label. It will only return a result if there is a unique hit.
Returns:
-
static
is_omim_disease
(gene_id)¶ Process omim equivalencies by examining the monarch ontology scigraph As an alternative we could examine mondo.owl, since the ontology scigraph imports the output of this script which creates an odd circular dependency (even though we’re querying mondo.owl through scigraph)
Parameters: - graph – rdfLib graph object
- gene_id – ncbi gene id as curie
- omim_id – omim id as curie
Returns: None
-
remove_control_characters
(s)¶ - Filters out charcters in any of these unicode catagories [Cc] Other, Control ( 65 characters)
- , …
- [Cf] Other, Format (151 characters) [Cn] Other, Not Assigned ( 0 characters – none have this property) [Co] Other, Private Use ( 6 characters) [Cs] Other, Surrogate ( 6 characters)
-
static
-
class
dipper.utils.GraphUtils.
GraphUtils
(curie_map)¶ Bases:
object
-
static
add_property_axioms
(graph, properties)¶
-
static
add_property_to_graph
(results, graph, property_type, property_list)¶
-
digest_id
()¶ Form a deterministic digest of input Leading ‘b’ is an experiment forcing the first char to be non numeric but valid hex Not required for RDF but some other contexts do not want the leading char to be a digit
: param str wordage arbitrary string : return str
-
static
get_properties_from_graph
(graph)¶ Wrapper for RDFLib.graph.predicates() that returns a unique set :param graph: RDFLib.graph :return: set, set of properties
-
write
(graph, fileformat=None, file=None)¶ A basic graph writer (to stdout) for any of the sources. this will write raw triples in rdfxml, unless specified. to write turtle, specify format=’turtle’ an optional file can be supplied instead of stdout :return: None
-
static
-
dipper.utils.pysed.
replace
(oldstr, newstr, infile, dryrun=False)¶ Sed-like Replace function.. Usage: pysed.replace(<Old string>, <Replacement String>, <Text File>) Example: pysed.replace(‘xyz’, ‘XYZ’, ‘/path/to/file.txt’)
This will dump the output to STDOUT instead of changing the input file. Example ‘DRYRUN’: pysed.replace(‘xyz’, ‘XYZ’, ‘/path/to/file.txt’, dryrun=True)
-
dipper.utils.pysed.
rmlinematch
(oldstr, infile, dryrun=False)¶ Sed-like line deletion function based on given string.. Usage: pysed.rmlinematch(<Unwanted string>, <Text File>) Example: pysed.rmlinematch(‘xyz’, ‘/path/to/file.txt’) Example: ‘DRYRUN’: pysed.rmlinematch(‘xyz’, ‘/path/to/file.txt’, dryrun=True) This will dump the output to STDOUT instead of changing the input file.
-
dipper.utils.pysed.
rmlinenumber
(linenumber, infile, dryrun=False)¶ Sed-like line deletion function based on given line number.. Usage: pysed.rmlinenumber(<Unwanted Line Number>, <Text File>) Example: pysed.rmlinenumber(10, ‘/path/to/file.txt’) Example ‘DRYRUN’: pysed.rmlinenumber(10, ‘/path/to/file.txt’, dryrun=True) #This will dump the output to STDOUT instead of changing the input file.
-
exception
dipper.utils.romanplus.
InvalidRomanNumeralError
¶
-
exception
dipper.utils.romanplus.
NotIntegerError
¶
-
exception
dipper.utils.romanplus.
OutOfRangeError
¶
-
exception
dipper.utils.romanplus.
RomanError
¶ Bases:
Exception
-
dipper.utils.romanplus.
fromRoman
(s)¶ convert Roman numeral to integer
-
dipper.utils.romanplus.
toRoman
(n)¶ convert integer to Roman numeral