PyPI license PyPI Python version PyPI version PyPI status

BioFlow Project

Information Flow Analysis in biological networks

Build Status Documentation Status Coverage Status

Overview:

BioFlow is a quantitative systems biology tool.

It leverages information complexity theory in order to generate mechanism-of-action hypotheses in a robust, quantified, highly granular, explainable, customizeable and near-optimal manner.

The core of the BioFlow is a biological knowledge entity relationship graph. It is derived from existing biological knowledge repositories, such as Reactome, Gene Ontology, HINT, Phosphosite, ComplexPortal or BioGrid. Upon importing them, BioFlow converts relationship

The biological knowledge entities from different repositories are cross-referenced and connections between them established based on how likely they are to be directly connected in an hypothesis for a general mechanism of action.

The input for BioFlow is a list of physical entities most likely involved in a process of interest - usually a list of genes or proteins.

BioFlow then examines all the possible paths linking those physical entities of interest and assigns an overall probability to all the paths, then proceeding to analyse the most likely common denominators for paths and assigning to each biological knowledge entity and relationship between them a chance of being part of the mechanism-of-action behind the process.

In order to avoid picking up on spurious associations, the resulting weights for biological knowledge entity and relationships between them are compared to the ones generated by random groups of genes and only based on that the final probabilities of inclusion into hypotheses for the mechanism-of-action are calculated.

The final probabilities for the knowledge entities are sorted by the probability of inclusion and are printed to the terminal and saved as a .tsv file, whereas the entire network with probabilities assigned to entities and relationships between them is saved for further analysis as a .gdf file.

BioFlow is:

  • robust, because we are weighting the relationships between the knowledge entities based on how likely they are to be real and be included in a mechanism of action. Thanks to that, we can include relationships we are not sure of (eg Y2H protein-protein interactions with low confidence score) and by assigning them a low weight make sure they will not be included into mechanisms of action hypothesis unless no other possible connection can explain a phenotype.
  • quantified, because every single biological knowledge entity is assigned both a weight and a p-value for how likely it is to contribute to a mechanism of action.
  • **highly granular because thanks to the weighting of edges and inclusion of multiple sources of biological knowledge, we can evaluate in the same pass steps for which we have a very good mecanistical understanding (eg specific cite phosphorilation on a specific isoform by another specific isoform) as well as steps for which we are not entirely convinced of in-vivo existence (once again, low score Y2H). Similarly, thanks to a unified model of biological entity
  • explainable, because all the steps in the hypothesis are biological entities and connections between them. As such, a direct translation from a BioFlow analysis to an experiment is “let’s suppress this entity/relationship with a high p-value and a high weight” and see if it affects our process.
  • customizable, because BioFlow is build as a library with multiple abstraction levels and customization capabilities.
    • The knowledge graph can be directly accessed and modified through the graphical user interface provided by the neo4j graph database storage back-end, as well as extracted as scipy.sparse weighted graph laplacian and adjacency matrices with index-to-entity-id maps.
    • By adding new rdf turtle tuple parsers into the bioflow.bio_db_parser and inserters into the bioflow.db_importers, new sources of biological knowledge can be integrated.
    • By modifying routines in the bioflow.algorithms_bank, new entity relationship weighting modes, background sampling algorithms or evaluation methods of hypotheses statistical significance can be introduced.
    • In case of absolute need, alternative storage backends can be implemented by re-implementing the GraphDBPipe object in bioflow.neo4j_db.cypher_drivers or methods from bioflow.sample_storage.mongodb.
  • near-optimal, because we are using a finite-world version of Solomonoff Algorithmic Inference - a provably optimal way of learning model representation from data. In order to accelerate the computation and stabilize the output with respect to the noise often encountered in biological data, we slightly modify the inference mode and background computation algorithm, hence the “near”-optimality.

Examples of applications:

A good way of thinking about BioFlow is as a specialized search engine for biological knowledge

Molecular mechanism hypotheses from high-throughput experiment results:

A lot of modern biology exploring phenotypes of interest from an unbiased perspective. In order to get an initial idea of a mechanism, a standard approach is to retrieve several experimental models that present a deviation of a phenotype from the model and perform a high-throughput experiment, such as mutagenisis, mutant library screen, transcription profiling or proteomics profiling. A result is often a list of hundreds of genes that are potentially involved in a mechanism, but it is rare that a mechanism is directly visible or even if there is any certainty that the list is not mostly composed of experimental artefacts.

BioFlow is capable of generating an unbiased, quantified list of hypotheses as to the molecular mechanism underlying the process.

  • Thanks to its integrated knowledge graph, it is capable to implicate mechanisms that would not be detectable by the screening method used to generate input data.
  • Thanks to its parallel evaluation of possible molecular mechanisms, it can point to backup mechanisms as well as molecular entities or pathways weekly involved in the process of interest.
  • Thanks to its null model of a random list of genes, it is capable to filter out random nodes that are due to artifacts.
  • Thanks to its null model, if the provided list of genes is a pure artefact, it will not call any nodes as likely paths and mark all hypotheses of molecular mechanisms as insufficiently likely.

Personalized cancer medicine:

Cancers are characterized by a large variety of mutation affecting numerous pathways in the bioilogical organisms. Some of the effects are antagonistic, some synergetic, but a lot of mutations/expression modifications are variations affecting similar pathways.

BioFlow is capable of integrating the list of perturbations found in the patients cancer (such as mutations, transcriptional modification, protein trafficking imbalance, …) and build a model of perturbed molecular pathways in the patient, allowing the drugs and drug combination selection to be prioritized.

Evaluation of the effects of large-scale genome perturbation:

In some cases, such as partial genome duplication or aneuploidy, or a recombination event, a large number of gene expression levels or protein structures are perturbed simultaneously. While models might exist for single genes or small groups of genes, the sheer number of perturbations and ways in which they can interfere makes the prediction intractable for humans.

In case any group of perturbation is likely to have a major synergistic effect, BioFlow will highlight them as well as the likely molecular mechanism they could act through.

In-silico drug secondary effects prediction:

Given that a lot of small-molecule drugs possess a polypharmacological multi-target activity, a number of secondary effects that cannot be traced to single targets being poisoned by the compound or its metabolic derivatives, are hypothesized to be related to the systemic effects of unspecific binding of the compound.

By combining the list of compounds that have been associated to a specific secondary effect as well as the binding profiles for those compounds (either measured in-vitro or simulated in-silico), BioFlow can create the network of the nodes and relationships most likely implicated in a mechanism of action underlying the secondary effect induction by off-target binding. By comparing the network flow generated by a de-novo compound binding profile to the ones engendered by the drugs presenting the secondary effect or not, we can infer most likely secondary effects to prioritize donwstream pre-clinical testing.

In-silico drug repurposing:

One of the advantages of the readily approved drugs is that they have been shown to be relatively safe in humans and have well-understood secondary effects. As such, their application in treatments of novel diseases is significantly more desireable than development of new compounds, both due a shorter time of development and only efficacy trails in humans being needed. This application is particularly interesting for rare and neglected diseases, where a de-novo compound development is usually not economically viable.

BioFlow can be used to construct the profile of biological entities that are most likely to be implicated in the molecular mechanism of action behind the disease (such as deviation from nor in human rare genetic disorders or essential pathways in pathogens). Based on the in-silico or in-vivo binding assays of approved compounds to the targets relevant to the phenotype, BioFlow can help to prioritize the compounds for further investigation.

Relationship to other methods:

Network diffusive models:

BioFlow generalizes the network diffusion models. While both BioFlow and network diffusion models rely on the graph laplacian and the flow through it, BioFlow uses that formalization to rank the most possible molecular mechanisms in a maximally unbiased, nearly optimal manner, whereas most network diffusion models work by pure analogy.

BioFlow’s near-optimality provides an explanation to the uncanny efficiency of the graph diffusion models, but in addition provides a direct interpretability of the results as well as suggests schemes for weighting the graph edges.

Network topology methods:

Compared to the network topology methods, thanks to its weighting scheme and all-paths probability evaluation, BioFlow is much less brittle to the inclusion of low-confidence edges affecting biological topology. It allows for multiple abstraction levels to be examined simultaneously, which is particularly interested in cases where a granular information is available only for a small subset of nodes and edges. In turn, this capability to work with multiple levels of abstraction granularity allows BioFlow to work with heterogeneous data, integrating different types of perturbation at the same time.

Annotation group methods:

Given that BioFlow does not rely on strict borders between categories, uses weights and simultaneously evaluates all the possible molecular mechanisms based on the data, it is significantly less brittle with regards to specific molecular entity inclusion in the annotation groups or the inclusion of specific molecular entities into the list associated to the process of interest.

Similarly, when BioFlow analyses the possible hypotheses for human-generated annotation networks, it provides much more interepretability with regards to the connection to the annotation proximity of terms, taking in account annotation terms overloading, single molecular entity over-annotation or confidence of molecular entity annotation with an annotation term.

Finally, by combining multiple annotation networks analysis with the molecular entity network, it is less prone to neglecting processes that have not yet been annotated in the annotation network.

Mechanistic models:

Given that BioFlow is capable to operate with multiple levels of granularity of biological knowledge abstractions and uses a unified model for all the molecular entities and relationships between them, BioFlow is able to work with many more data types, does not require exact knowledge to generate hypotheses and is computationally more simple and stable at scale. Similarly, it is able to suggest possible mechanisms where there are yet no mechanistic models.

However, BioFlow’s model means that it has a more restrictetd expressivity. For instance, it will not be able to recognize synergistic vs antagonistic interaction between perturbations it is analyzing or distinguish repressors from inducers.

Overall, BioFlow is a good precursor to mechanistic models, if nodes that and interactions that are ranked highly by BioFlow have a strong overlap with known mechanistic models.

Functioning and specifics of the implementation:

BioFlow requires an instance of neo4j graph database running for the main knowledge repository, as well as an instance of the MongoDB.

Upon start, BioFlow will look for $BIOFLOWHOME environment variable to know where to store its files. If none found, it will use the default ~/bioflow directory.

Inside the $BIOFLOWHOME it will store the user configs .yaml file ($BIOFLOWHOME/configs/main_configs.yaml). If for whatever reason it doesn’t find them, it will copy the default configs it has there. If you want to reset configs to default, just delete or rename your config yaml file.

The config contains several sections:

  • DB_locations: maps where to look for the databases it uses to build the main biological entity relationship graph and where to store them locally. If you get an error on download, chances are one of the source databases has moved. Alternatively, if you want to use a specific snapshot of the database, you can change the online location the file is loaded from.
  • Servers: stores the urls and ports BioFlow will expect MongoDB and Neo4j to be available.
  • Sources: allows to select the organism. If you are not sure of what you are doing, just uncomment the organism you want to work on.
  • User_settings:
    • smtp_logging: enable and configure if you want to receive notifications about errors or run finishing by mail. Given you will need a local smtp server sending mails properly, setting this section is not for the faint of heart.
    • environement: modifies how some aspects of BioFlow work. Comments explain what it does, but you will need to understand the inner workings of BioFlow to know how it works.
    • analysis: controls the parameters used to calculate statistical significance
    • debug_flags: potentially useful if you want to debug an issue or fill out a bug report.

Everything is logged to $BIOFLOWHOME/.internal/logs. As such, debugs, critical errors and warnings are all stored there.

Upon execution a run output folder in $BIOFLOWHOME/outputs/ is created with the datetime ISO name that will contain any output generated by the run, as well as info-level log (basically, a copy of what is printed on the console).

Finally, due to large differences in topological structure and weighting algorithms, the analysis of biological knowledge nodes that represent molecular entities (proteins, isoforms, small molecules) and the ones that represent human-made abstractions to reason about them (Gene Ontology terms, Pathways, …) are split into two different modules (molecular network/Interactome vs annotation network/BioKnowledge modules/classes).

The full API documentation is available at readthedocs.org.

Basic Usage:

Installation walk-through:

Ubuntu direct installation:

1) Install the Anaconda python 3.X and use the python provided by Anaconda python in all that follows. A way of doing it is by Making Anacoda Python your default Python. The full process is explained here.

  1. Install libsuitesparse:

    > apt-get -y install libsuitesparse-dev
    
  2. Install neo4j:

    > wget -O - https://debian.neo4j.org/neotechnology.gpg.key | sudo apt-key add -
    > echo 'deb https://debian.neo4j.org/repo stable/' | sudo tee /etc/apt/sources.list.d/neo4j.list
    > sudo apt-get update
    > sudo apt-get install neo4j
    
  3. Install MongDB (Assuming Linux 18.04 - if not, see here):

    > sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 9DA31620334BD75D9DCB49F368818C72E52529D4
    > echo "deb [ arch=amd64 ] https://repo.mongodb.org/apt/ubuntu bionic/mongodb-org/4.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-4.0.list
    > sudo apt-get update
    > sudo apt-get install -y mongodb-org
    

For more information, refer to the installation guide

  1. Finally, install BioFlow through pip:

    > pip install BioFlow
    

Or, if you want to install it directly:

> git clone https://github.com/chiffa/BioFlow.git
> cd <installation directory/BioFlow>
> pip install -r requirements.txt

Docker:

If you want to build locally (notice you need to issue docker commands with the actual docker-enabled user; usually prepending sudo to the commands):

> git clone https://github.com/chiffa/BioFlow.git
> cd <BioFlow installation folder>
> docker build -t "bioflow" .
> docker run bioflow
> docker-compose build
> docker-compose up -d

If you want to pull from dockerhub or don’t have access to BioFlow installation directory:

> wget https://github.com/chiffa/BioFlow/blob/master/docker-compose.yml
> mkdir -p $BIOFLOWHOME/input
> mkdir -p $BIOFLOWHOME/source
> mkdir -p $BIOFLOWHOME/.internal/docker-mongo/db-data
> mkdir -p $BIOFLOWHOME/.internal/docker-neo4j/db-data
> docker-compose build
> docker-compose up -d

Finally attach to the running container:

> docker attach bioflow_bioflow_1

For working from docker, you will have to have $BIOFLOWHOME environment variable defined (by default $HOME/bioflow).

Scripts with which docker build was tested can be found in the docker_script.sh file.

For persistent storage, the data will be stored in the mapped volumes as follows:

Volume mapping
What Docker On disk
neo4j data /data (neo4j docker) $BIOFLOWHOME/.internal/docker-neo4j/db-data
mongodb data /data/db (mongodb docker) $BIOFLOWHOME/.internal/docker-mongo/db-data
bioflow home /bioflow $BIOFLOWHOME
inputs /input $BIOFLOWHOME/input

Usage walk-through:

Warning

While BioFlow provides an interface to download the databases programmatically, the databases are subject to Licenses and Terms that it’s up to the end users to respect

For more information about data and config files, refer to the data and database guide

Python scripts:

This is the recommended method for using BioFlow.

An example usage script is provided by bioflow.analysis_pipeline_example.py.

First, let’s pull the online databases:

> from bioflow.utils.source_dbs_download import pull_online_dbs
> pull_online_dbs()

Now, we can import the main knowledge repository database handlers and build the main database:

> from bioflow.db_importers.import_main import destroy_db, build_db
> build_db()

The building process will take a bit - up to a couple of hours.

Now, you can start using the BioFlow proper:

> from bioflow.annotation_network.knowledge_access_analysis import auto_analyze as \
>    knowledge_analysis, _filter
> from bioflow.molecular_network.interactome_analysis import auto_analyze as interactome_analysis

> hits_file = "/your/path/here.tsv"
> background_file = "/your/other_path/here.tsv"

And get to work: map the hits and the background genes to internal db IDs:

> hit_ids, background_ids = map_and_save_gene_ids(hits_file, background_file)

BioFlow expects the tsv/csv for hits or background files to contain one hit per line, and will attempt to map them to UNIPROT protein nodes (used as a backbone to cross-link imported databases), based on the following identifier types:

  • Gene names
  • HGCN symbols
  • PDB Ids
  • ENSEMBL Ids
  • RefSeq IDs
  • Uniprot IDs
  • Uniprot accession numbers

(Re)build the laplacians (not required unless the knowledge structure in the main knowledge database has changed):

> rebuild_the_laplacians()

Launch the analysis itself for the information flow in the interactome:

> interactome_analysis(source_list=[hits_ids],
                       output_destinations=[`<name_of_experiment>`],     #optional
                       desired_depth=20,                                 #optional
                       processors=3,                                     #optional
                       background_list=background_bulbs_ids,             #optional
                       skip_sampling=False)                              #optional

Launch the analysis itself for the information flow in the annotation network (experimental):

> knowledge_analysis(source_list=[hits_ids],
                     output_destinations=[`<name_of_experiment>`],     #optional
                     desired_depth=20,                                 #optional
                     processors=3,                                     #optional
                     background_list=background_bulbs_ids,             #optional
                     skip_sampling=False)                              #optional

Where:

hits_ids:list of hits
output_destinations:
 names to provide for the output destination (by default numbered from 0)
desired_depth:how many samples we would like to generate to compare against
processors:how many threads we would like to launch in parallel (in general 3/4 works best)
background_list:
 list of background Ids
skip_sampling:if true, skips the sampling of background set and retrieves stored ones instead

BioFlow will print progress to the StdErr from then on and will output to the $BIOFLOWHOME, in a folder called outputs_YYYY-MM_DD <launch time>:

  • .gdf file with the flow network and relevance statistics (Interactome_Analysis_output.gdf)
  • visualisation of information flow through nodes in the null vs hits sets based on the node degree
  • list of strongest hits (interactome_stats.tsv) (printed to StdOut as well)

The .gdf file can be further analysed with more appropriate tools, such as for instance Gephi.

Enabling the SMTP logging would require you to manually build a try-except around your script code:

> from bioflow.utils.smtp_log_behavior import get_smtp_logger, started_process, \
successfully_completed, smtp_error_bail_out

> try:
>   started_process()
> except Exception as e:
>   smtp_error_bail_out()
>   raise e

> try:
>   <your code here>
> except Exception as e:
>   try:
>       logger.exception(e)
>   except Exception as e:
>        smtp_error_bail_out()
>        raise e
>   raise e

> else:
>   try:
>       successfully_completed()
>   except Exception as e:
>       smtp_error_bail_out()
>       raise e

Command line:

Command line can either be invoked by python execution:

> python -m bioflow.cli <command> [--options]

Or, in case of installation with pip, directly from a command line (assumed here):

> bioflow <command> [--options]

Setup environment (likely to take a while top pull all the online databases):

> bioflow downloaddbs
> bioflow loadneo4j

Set the set of perturbed proteins on which we would want to base our analysis

> bioflow mapsource /your/path/here.tsv --background=/your/other_path/here.tsv

Rebuild the laplacians

> bioflow rebuildlaplacians

Perform the analysis:

> bioflow analyze --matrix interactome --depth 24 --processors 3 --background True
                --name=<name_of_experiment>

> bioflow analyze --matrix annotome --depth 24 --processors 3 --background True
                --name=<name_of_experiment>

Alternatively:

> bioflow analyze --depth 24 --processors 3 --background True --name=<name_of_experiment>

More information is available with:

> bioflow --help

> bioflow about

The results of analysis will be available in the output folder, and printed out to the StdOut.

Post-processing:

The .gdf file format is one of the standard format for graph exchange. It contains the following columns for the nodes:

  • node ID
  • information current passing through the node
  • node type
  • legacy_id
  • degree of the node
  • whether it is present or not in the hits list (source)
  • p-value, comparing the information flow through the node to the flow expected for the random

set of genes - -log10(p_value) (p_p-value) - rel_value (information flow relative to the flow expected for a random set of genes) - std_diff (how many standard deviations above the flow for a random set of genes the flow from a hits list is) (not a robust metrics)

The most common pipleine involves using Gephi open graph visualization platform:

  • Load the gdf file into gephy
  • Filter out all the nodes with information flow below 0.05 (Filters > Atrributes > Range > current)
  • Perform clustering (Statistics > Modularity > Randomize & use weights)
  • Filter out all the nodes below a significance threshold (Filters > Attributes > Range > p-value)
  • Set Color nodes based on the Modularity Class (Nodes > Colors > Partition > Modularity Class)
  • Set node size based on p_p-value (Nodes > Size > Ranking > p_p-value )
  • Set text color based on whether the node is in the hits list (Nodes > Text Color > Partition > source)
  • Set text size based on p_p-value (Nodes > Text Size > Ranking > p_p-value)
  • Show the lables (T on the bottom left)
  • Set labes to the legacy IDs (Notepad on the bottom)
  • Perform a ForeAtlas Node Separation (Layout > Force Atlas 2 > Dissuade Hubs & Prevent Overlap)
  • Adjust label size
  • Adjust labels position (Layout > LabelAdjust)

For more details or usage as a library, refer to the usage guide.

Usage Guides:

Data and databases setup:

Assembling the files required for the database creation:

In order to build the main knowledge repository, BioFlow will go and look for the following data repositories specified in the $BIOFLOWHOME/configs/main_configs.yaml file:

  • OBO 1.2 file of GO terms and relations,

    • dowloaded from: here
    • will download/look for for go.obo file at $DB_HOME$/GO/
  • UNIPROT-SWISSPROT .txt text database file

    • downloaded from here
    • will store/look for the master tab file at $DB_HOME$/Uniprot/uniprot_sprot.dat
    • will load the information specified by the NCBI tax id for the organism currently loaded
  • Reactome.org “Events in the BioPax level 3” file

    • downloaded from here
    • will store/look for the files relevant to the organism at $DB_HOME$/Reactome/
  • HiNT PPI binary interaction files for the organisms of interest

    • dowloaded from here, here and here
    • will store/look for the files SaccharomycesCerevisiaeS288C_binary_hq.txt, HomoSapiens_binary_hq.txt, MusMusculus_binary_hq.txt at the path $DB_HOME$/HiNT/
  • BioGRID ALL_ORGANISMS PPI file in the tab2 format

    • dowloaded from here
    • will store/look for the files at $DB_HOME$/BioGRID/
  • TRRUST literature-curated TF-target interaction files in tsv format

    • downloaded from here and here
    • will store/look for the files at $DB_HOME$/TFs/TRRUST
  • IntAct ComplexPortal tsv files

    • dowloaded from here and here
    • will store/look for the files at $DB_HOME$/ComplexPortal
  • Phosphosite protein kinase-subrstrate tsv files

    • downloaded from here
    • will store/look for the files at $DB_HOME$/PhosphoSite

It is possible to specify the file locations and identifiers manually, and then download and install them. This is to be used when the download locations for the files move.

Similarly, the configs file also controls the organism selection. Three organisms have provided configurations (human, mouse, S. Cerevisiae). Using the same pattern, other organisms can be configured, also the lack of data can be a problem (this is already the case for mouse - we recommend mapping the genes to human if the mouse is used as a model for the organism).

Warning

While BioFlow provides an interface to download the databases programmatically, the databases are subject to Licenses and Terms that it’s up to the end users to respect

Adding new data to the main knowledge repository:

The easiest way to add new information to the main knowledge repository is by finding the nodes to which new knowledge will attach (provided by the convert_to_internal_ids function from the bioflow.neo4j_db.db_io_routines module for a lot of x-ref identifiers for physical entity nodes), and then process to add new relationships and nodes using the functions DatabaseGraph.link to add a connection between nodes and DatabaseGraph .create to add a new node. DatabaseGraph.attach_annotation_tag can be used in order to attach annotation tags to new nodes that can be searcheable from the outside. All functions can be batched (cf API documentation).

A new link will have a call signature of type link(node_id, node_id, link_type, {param: val}) ``, where node_ids are internal database ids for the nodes provided by the ``convert_to_internal_ids function, link_type is a link type that would be handy for you to remember (preferably in the snake_case). Two parameters are expected: source and parse_type. parse_type can only take a value in ['physical_entity_molecular_interaction', 'identity', 'refines', 'annotates', 'annotation_relationship', 'xref'], with 'xref' being reserved for the annotation linking.

A new node will have a call signature of type create(node_type, {param:val}) and return the internal id of the created node. node_type is a node type that would be handy for you to remember (preferably in the snake_case). Four paramters are expected: 'parse_type', 'source', 'legacyID' and 'displayName'. 'parse_type' can take only values in ['physical_entity', 'annotation', 'xref'], with 'xref' being reserved for the annotation linking. legacyID is the identifier of the node in the source database and displayName is the name of the biological knowledge node that that will be shown to the end user.

Main knowledge graph parsing:

Given the difference in the topology and potential differences in the underlying assumptions, we pull the interactome knowledge network (where all nodes map to molecular entities and edges - to physical/chemical interaction between them) and teh annotome knowledge network (where some nodes might be concepts used to understand the biological systems - such as ontology terms or pathways) separately.

The parse for interactome is performed by retrieving all the nodes and edges whose parse_type is physical_entity for nodes and physical_entity_molecular_interaction, identity or refines. The giant component of the interactome is then extracted and two graph matrices - adjacency and laplacian - are build for it. Weights between the nodes are set in an additive manner according to the policy supplied as the argument to the InteractomeInterafce .full_rebuild function or, in a case a more granular approach is needed to the InteractomeInterafce.create_val_matrix function. By default the active_default_<adj/lapl>_weighting_policy functions are used from the bioflow.algorithms_bank.weigting_policies module. Resulting matrices are stored in the InteractomeInterface.adjacency_matrix and InteractomeInterface.laplacian_matrix instance variables, whears the maps between the matrix indexes and maps are stored in the .neo4j_id_2_matrix_index and .matrix_index_2_neo4j_id variables.

The parse for the annotome is performed in the same way, but matching parse_type for nodes to physical_entity and annotation. In case of a proper graph build, this will result only in the edges of types annotates and annotation_relationship to be pulled. Weighting functions are used in the similar manner, as well as the mappings storage.

Custom weighting function:

In order to account for different possible considerations when deciding which nodes and connections are more likely to be included in hypothesis generation, we provide a possibility for the end user to use their own weight functions for the interactome and the annotome.

The provided functions are stored in bioflow.algorithms_bank.weighting_policies module. An expected signature of the function is starting_node, ending_node, edge > float, where starting_node and ending_node are of <neo4j-driver>.Node type, whereas edge is of the <neo4j-driver>.Edge type. Any properties available stored in the main knowledge repository (neo4j database) will be available as dict-like properties of node/edge objects (<starting/ending>_node['<property>']/edge['property']).

The functions are to be provided to the bioflow.molecular_network .InteractomeInterface.InteractomeInterface.create_val_matrix() method as <adj/lapl>_weight_policy_function for the adjacency and laplacian matrices respectively.

Custom flow calculation function:

In case a specific algorithms to generate pairs of nodes between which to calculate the information flow is needed, it can be assigned to the InteractomeInterface ._flow_calculation_method. It’s call signature should conform to the list, list, int -> list signature, where the return list is the list of pairs of (node_idx, weight) tuples. By default, the general_flow method from bioflow.algorithms_bank.flow_calculation_methods will be used. It will try to match the expected flow calcualtion method based on the parameters provided (connex within a set if the secondary set is empty/None, star-like if the secondary set only has one element, biparty if the secondary set has more than one element).

Similarly, methods to evaluate the number of operations and to reduce their number to a maximum ceiling with the optional int argument sparse_rounds needs to be assigned to the InteractomeInterface._ops_evaluation_method and InteractomeInterface ._ops_reduction_method. By default, the are evaluate_ops and reduce_ops from bioflow.algorithms_bank.flow_calculation_methods.

Custom random set sampling strategy:

In case a custom algorithm for the generation of the background sample needs to be implemented, it should be supplied to the InteractomeInterace.randomly_sample method as the sampling_policy argument.

It is expected to accept the an example of sample and secondary sample to match, background from which to sample, number of samples desired and finally a single string parameter modifying the way it works (supplied by the sampling_policy_options parameter of the InteractomeInterace.randomly_sample method).

By default, this functions implemented by the matched_sampling fundion in the bioflow.algorithms_bank.sampling_policies module.

Custom significance evaluation:

by default, auto_analyze functions for the interactome and the annotome will use the default compare_to_blank functions and seek to determine the significance of flow based on comparison of the flow achieved by nodes of a given degree in the real sample compared to the random “mock” samples. The comparison will be performed using Gumbel_r function fitted to the highest flows achieved by the “mock” runs.

As of now, to change the mode of statistical significance evaluation, a user will need to re-implement the compare_to_blank functions and mokey-patch them in the modules containing the auto_analyze function.

API documentation:

bioflow package

Subpackages

bioflow.algorithms_bank package

Submodules
bioflow.algorithms_bank.clustering_routines module
bioflow.algorithms_bank.conduction_routines module
bioflow.algorithms_bank.flow_calculation_methods module

These methods are responsible for generation of pairs of nodes for which we will be calculating and summing the flow.

bioflow.algorithms_bank.flow_calculation_methods.evaluate_ops(prim_len: int, sec_len: int, sparse_rounds: int = -1) → float[source]

Evaluates the number of total node pair flow computations needed to calculate the complete flow in the sample according to the general_flow policy.

Parameters:
  • prim_len – length of the primary set
  • sec_len – length of the secondary set
  • sparse_rounds – sparse rounds.
Returns:

bioflow.algorithms_bank.flow_calculation_methods.general_flow(sample: Union[List[int], List[Tuple[int, float]]], secondary_sample: Union[List[int], List[Tuple[int, float]], None] = None, sparse_rounds: int = -1) → List[Tuple[Tuple[int, float], Tuple[int, float]]][source]

Performs the information flow computation best matching the provided parameters.

Parameters:
  • sample – primary sample of nodes
  • secondary_sample – secondary sample of nodes
  • sparse_rounds – sparse rounds, in case samples are too big
Returns:

bioflow.algorithms_bank.flow_calculation_methods.reduce_and_deduplicate_sample(sample: Union[List[int], List[Tuple[int, float]]]) → List[Tuple[int, float]][source]

Deduplicates the nodes found in the sample by adding weights of duplicated nodes. In case a list of node ids only is provided, transforms them into a weighted list with all weights set to 1.

Parameters:sample – sample to deduplicate and/or add weights to
Returns:
bioflow.algorithms_bank.flow_calculation_methods.reduce_ops(prim_len, sec_len, max_ops) → int[source]

Determines the sparse_rounds parameter that needs to be used in order to maintain the total number of pairs of nodes needed to calculate the complete flow in the sample according to the general_flow_policy.

Parameters:
  • prim_len – length of the primary set
  • sec_len – length of the secondary set
  • max_ops – maximum allowed number of node pairs
Returns:

bioflow.algorithms_bank.flow_significance_evaluation module
bioflow.algorithms_bank.flow_significance_evaluation.get_neighboring_degrees()[source]

Recovers the maximum flow achieved by nodes of a given degree for each run. On case the user requests it with nearest_degrees or min_nodes parameters, also recovers maximum flow values for the nodes of similar degrees or looks fro flow values in nearest degrees until at least min_nodes are found nearest_degrees

Parameters:
  • degree – degree of the nodes
  • max_array – maximum nodes for a given degree in each run
  • nearest_degrees – the minimum number of the nearest gedgrees to look for
  • min_nodes – the minimum number of nodes until which to look for neighbours
Returns:

bioflow.algorithms_bank.flow_significance_evaluation.get_p_val_by_gumbel()[source]

Recovers the statistical significance (p-value equivalent) by performing a gumbel test

Parameters:
  • entry – the values achieved in the real hits information flow computation
  • max_set_red – background set of maximum values achieved during blanc sparse_sampling runs
Returns:

bioflow.algorithms_bank.sampling_policies module

This module defines the policies that will be used in order to sample the information flow patterns to compare with.

The general approach is a function that takes in any eventual parameters and outputs a list of pairs of DB_Ids for which the flow will be calculated.

bioflow.algorithms_bank.sampling_policies.characterize_flow_parameters(sample: Union[List[int], List[Tuple[int, float]]], secondary_sample: Union[List[int], List[Tuple[int, float]], None], sparse_rounds: int)[source]

Characterizes the primary and secondary sets and computes their hash, that can be used ot match similar samples for random sampling.

Parameters:
  • sample – primary set
  • secondary_sample – secondary set
  • sparse_rounds – if sparse rounds are to be performed
Returns:

first set length, shape, hist, second set length, shape, hist, sparse rounds, hash

bioflow.algorithms_bank.sampling_policies.matched_sample_distribution()[source]

Tries to guess a distribution of floats and sample from it. uses np.histogram with the number of bins equal to the granularity parameter. For each sample, selects which bin to sample and then picks from the bin a float according to a uniform distribution. if logmode is enabled, histogram will be in the log-space, as well as the sampling.

Parameters:
  • floats_arr – array of floats for which to match the distribution
  • samples_no – number of random samples to retrieve
  • granularity – granularity at which to operate
  • logmode – if sample in log-space
Returns:

samples drawn from the empirically matched distribution

bioflow.algorithms_bank.sampling_policies.matched_sampling(sample, secondary_sample, background, samples, float_sampling_method='exact')[source]

The general random sampling strategy that sample sets of the same size and shape as primary and secondary sample set and, if they are weighted, try to match the random sample weights according to the

Parameters:
  • sample – primary sample set
  • secondary_sample – secondary sample_set
  • background – background of ids (and potentially weights) from which to sample
  • samples – random samples wanted
  • sampling_mode – exact/distro/logdistro. the sampling parametrization method ingesting

all the parameters in a single string argument in the general case, here, a pass- through parameter for the _sample_floats function if samples are weighted and the distribution of weights is being matched. :return:

bioflow.algorithms_bank.weigting_policies module
Module contents

bioflow.annotation_network package

Submodules
bioflow.annotation_network.BioKnowledgeInterface module
bioflow.annotation_network.additional module
bioflow.annotation_network.knowledge_access_analysis module
Module contents

bioflow.bio_db_parsers package

Submodules
bioflow.bio_db_parsers.ComplexPortalParser module
bioflow.bio_db_parsers.ComplexPortalParser.parse_complex_portal(complex_portal_file)[source]
bioflow.bio_db_parsers.PhosphositeParser module
bioflow.bio_db_parsers.PhosphositeParser.parse_phosphosite(phoshosite_file, organism)[source]

Parses the phosphocite tsv file

Parameters:
  • phoshosite_file
  • organism
Returns:

bioflow.bio_db_parsers.geneOntologyParser module

Contains the functions responsible for the parsing of the GO terms

class bioflow.bio_db_parsers.geneOntologyParser.GOTermsParser[source]

Bases: object

Wrapper object for a parser of GO terms.

flush_block()[source]

flushes all temporary term stores to the main data stores

parse_go_terms(source_file_path)[source]

Takes the path to the gene ontology .obo file and returns result of parse dict and list

Parameters:source_file_path – gene ontology .obo file
Returns:dict containing term parse, list containing inter-term relationship (turtle)

triplets

parse_line_in_block(header, payload)[source]

Parses a line within GO term parameters block

Parameters:
  • header – GO term parameter name
  • payload – GO term parameter value
start_block()[source]

resets temporary stores to fill so that a new term can be loaded

bioflow.bio_db_parsers.proteinRelParsers module

Protein relationships parser

bioflow.bio_db_parsers.proteinRelParsers.parse_bio_grid(bio_grid)[source]

Parses the given file as a BioGrid file and returns as

Parameters:bio_grid – the location of the biogrid_path bioflow file that needs to bprased
Returns:
bioflow.bio_db_parsers.proteinRelParsers.parse_hint(_hint_csv)[source]

Reads protein-protein relationships from a HiNT database file

Parameters:_hint_csv – location of the HiNT database tsv file
Returns:{UP_Identifier:[UP_ID1, UP_ID2, …]}
bioflow.bio_db_parsers.reactomeParser module
bioflow.bio_db_parsers.tfParsers module
bioflow.bio_db_parsers.uniprotParser module

The module responsible for parsing of the Uniprot SWISSPROT .dat file for a subset of cross-references that are useful in our database.

Once uniprot is parsed, it is returned as the dictionary containing the following elements:

Uniprot = { SWISSPROT_ID:{
‘Acnum’:[], ‘Names’: {‘Full’: ‘’, ‘AltNames’: []}, ‘GeneRefs’: {‘Names’: [], ‘OrderedLocusNames’: [], ‘ORFNames’: []}, ‘TaxID’: ‘’, ‘Ensembl’: [], ‘KEGG’: [], ‘EMBL’: [], ‘GO’: [], ‘Pfam’: [], ‘SUPFAM’: [], ‘PDB’: [], ‘GeneID’: [], }}
class bioflow.bio_db_parsers.uniprotParser.UniProtParser(tax_ids_to_parse)[source]

Bases: object

Wraps the Uniprot parser

end_block()[source]

Manages the behavior of the end of a parse block

Returns:
get_access_dicts()[source]

Returns an access dictionary that would plot genes names, AcNums or EMBL identifiers to the Swissprot IDs

Returns:dictionary mapping all teh external database identifiers towards uniprot IDs
parse_gene_references(line)[source]

Parses gene names and references from the UNIPROT text file

Parameters:line
parse_name(line)[source]

Parses a line that contains a name associated to the entry we are trying to load

Parameters:line
Returns:
parse_uniprot(source_path)[source]

Performs the entire uniprot file parsing and importing

Parameters:source_path – path towards the uniprot test file
Returns:uniprot parse dictionary
parse_xref(line)[source]

Parses an xref line from the Uniprot text file and updates the provided dictionary with the results of parsing

Parameters:line
process_line(line, keyword)[source]

A function that processes a line parsed from the UNIPROT database file

Parameters:
  • line
  • keyword
Module contents

bioflow.db_importers package

Submodules
bioflow.db_importers.biogrid_importer module
bioflow.db_importers.complex_importer module
bioflow.db_importers.go_and_uniprot_importer module
bioflow.db_importers.hint_importer module
bioflow.db_importers.import_main module
bioflow.db_importers.phosphosite_importer module
bioflow.db_importers.reactome_importer module
bioflow.db_importers.tf_importers module
Module contents

This module manages the import of database connections into the Neo4j instance. In case the weights are presented as ID_1, ID_2, weight, the insertion can be done by adapting the “insert into the database” method from almost any database insertion.

Pasring and insertion of the files in the BioPax format is more complex - please refer to the Reactome parsing and inserting

bioflow.molecular_network package

Submodules
bioflow.molecular_network.InteractomeInterface module
bioflow.molecular_network.interactome_analysis module
Module contents

bioflow.neo4j_db package

Submodules
bioflow.neo4j_db.GraphDeclarator module
bioflow.neo4j_db.cypher_drivers module
bioflow.neo4j_db.db_io_routines module
Module contents

Methods to interface and interact with the database of biological knowledge used as a backbone

bioflow.pre_processing package

Submodules
bioflow.pre_processing.remap_IDs module

Remaps the IDs of the gene identifiers to a different organism before processing them

These are mainly interesting to for the applications where there is little information about the genetic networks specific to the organism in question and we would like to use it as a model for another organism (eg mice vs human) and we want to project genes associated to the phenotype in the model organism into the networks associated to the original organism.

bioflow.pre_processing.remap_IDs.translate_identifiers(data_source_location, data_dump_location, translation_file_location, gene_to_id_file_location)[source]

Performs a translation of gene identifiers from one organism to another one.

Parameters:
  • data_source_location – where the gene id to translate are
  • data_dump_location – where the translated gene ids will be stored
  • translation_file_location – where the file that contains gene id mappings is located. expects a .tsv file with [Source org gene id, dest org gene id, dest org gene symbol, confidence (1=high, 0=low), …] per line
  • gene_to_id_file_location – (optional) where the file that maps gene ids to HGCN names expects a .tsv file with [Gene stable ID, Transcript stable ID, Gene name, HGNC symbol]
Returns:

bioflow.pre_processing.rna_counts_analysis module
Module contents

bioflow.utils package

Subpackages
bioflow.utils.general_utils package
Submodules
bioflow.utils.general_utils.debug_scripts module
bioflow.utils.general_utils.deprecated_id_translations module
bioflow.utils.general_utils.dict_like_configs_parser module
bioflow.utils.general_utils.high_level_os_io module

Saner io and filesystem manipulation compared to python defaults

bioflow.utils.general_utils.high_level_os_io.copy_if_doesnt_exist(source, destination)[source]

Copies a file if it does not exist

Parameters:
  • source
  • destination
Returns:

bioflow.utils.general_utils.high_level_os_io.mkdir_recursive(path)[source]

Recursively creates a directory that would contain a file given win-like filename (xxx.xxx) or directory name :param path: :return:

bioflow.utils.general_utils.high_level_os_io.wipe_dir(path)[source]

wipes the indicated directory :param path: :return: True on success

bioflow.utils.general_utils.id_translations module
bioflow.utils.general_utils.internet_io module

Module responsible for retrieval of files stored on the internet. Requires some refactoring before can be considered a library in its onw right.

bioflow.utils.general_utils.internet_io.check_hash(file_path, expected_hash, hasher)[source]

Checks a expected_hash of a file :param file_path: :param expected_hash: :param hash_type: :return:

bioflow.utils.general_utils.internet_io.marbach_post_proc(local_directory)[source]

Function to post-process a specific online database that is rather quite unfriendly

Returns:
bioflow.utils.general_utils.internet_io.url_to_local(url, path, rename=None)[source]

Copies a file from an http or ftp url to a local destination provided in path while choosing a good decompression algorithm so far, only gunzipped ftp url downloads and path autocompletion only for non-compressed files are supported

Parameters:
  • url
  • path
Raises:

Exception – renaming for gunzip and zipped files is not supported

Returns:

bioflow.utils.general_utils.internet_io.url_to_local_p_gz(url, path)[source]

Copies a file from an http or ftp url to a local destination provided in path :param url: :param path: :raise Exception: something is wrong with the uri :return:

bioflow.utils.general_utils.internet_io.url_to_local_p_zip(url, path)[source]

Copies a file from an http url to a local folder provided in path :param url: :param path: :raise Exception: something is wrong with the path :raise Exception: something is wrong with the uri :return:

bioflow.utils.general_utils.internet_io.url_to_local_path(url, path, rename=None)[source]

Copies a file from an http url to a local destination provided in path. Performs file-to-folder converstion :param url: :param path: :raise Exception: something is wrong with the uri :return:

bioflow.utils.general_utils.useful_wrappers module

Contains a set of wrappers that are useful for debugging and operation profiling

bioflow.utils.general_utils.useful_wrappers.debug_wrapper(function_of_interest)[source]

Convenient wrapper inspecting the results of a function returning a single matrix

Parameters:function_of_interest
Returns:wrapped functions with copied name and documentation
bioflow.utils.general_utils.useful_wrappers.my_timer(message='', previous_time=[])[source]

A small timer to be used in code to measure the execution duration of elements of code

Parameters:
  • message
  • previous_time
Returns:

bioflow.utils.general_utils.useful_wrappers.render_2d_matrix(matrix, name)[source]

Subroutine requried by the rendering wrapper.

Parameters:
  • matrix
  • name
Returns:

bioflow.utils.general_utils.useful_wrappers.time_it_wrapper(function_of_interest)[source]

Convenient wrapper for timing the execution time of functions :param function_of_interest: :return: wrapped functions with copied name and documentation

Module contents
Submodules
bioflow.utils.dataviz module
bioflow.utils.gdfExportInterface module
bioflow.utils.io_routines module
bioflow.utils.linalg_routines module
bioflow.utils.log_behavior module

File managing most of the logging behavior of the application.

bioflow.utils.log_behavior.add_to_file_handler(my_logger, level, file_name, rotating=False, log_location='/home/docs/bioflow/.internal/logs')[source]

Adds a file-writing handler for the log.

Parameters:
  • my_logger – the logger to which add a handler
  • level – logging.DEBUG or other level
  • file_name – short file name, that will be stored within the application logs location
  • rotating – if true, rotating file handler will be added.
Returns:

bioflow.utils.log_behavior.clear_logs()[source]

Wipes all logs

bioflow.utils.log_behavior.get_logger(logger_name)[source]

Returns a properly configured logger object

Parameters:logger_name – name of the logger object
bioflow.utils.log_behavior.mkdir_recursive(my_path)[source]

Copy of mkdir recursive from saner configs, used here to remove circular dependencies Recursively creates a directory that would contain a file given a windows-like filename (xxx.xxx) or a directory name

Parameters:my_path – path which to recursively create
Returns:None
bioflow.utils.log_behavior.wipe_dir(my_path)[source]

wipes the indicated directory

Parameters:my_path – path to wipe
Returns:True on success
bioflow.utils.random_path_statistics module
bioflow.utils.remap_IDs module
bioflow.utils.smtp_log_behavior module
bioflow.utils.source_dbs_download module
bioflow.utils.top_level module
bioflow.utils.usage_ex-average_path_length module
Module contents

Submodules

bioflow.analysis_pipeline_example module

bioflow.cli module

This modules manages the command line interface

bioflow.cli.print_version(ctx, value)[source]

Module contents

Indices and tables: