Welcome to the VEOIBD Synapse Data Manager documentation!¶
Contents:
Project Description¶
Admin related logistics regarding uploading and annotating data to Synapse for members of the VEOIBD consortium.
Before we begin¶
There are a few things that we take for granted at this point.
- You are a registered Synapse user.
- You have either Conda or Miniconda installed.
- You have the ability to execute Makefiles.
- Linux/Unix/OS X should work fine out of the box.
- Windows will need Cygwin installed or take advantage of the ability to run bash in Windows 10.
Registering with Synapse¶
- Create an account at Synapse
- You will need to register and become a certified user.
Installing¶
Download this project repository¶
Using git...
git clone https://github.com/ScottSnapperLab/veoibd-synapse-data-manager.git
Using your web browser...
- Click here to see a list of releases.
- Download and unzip.
- Navigate into the project folder.
“Install” the package¶
At your terminal, in the directory we created above:
make install
Now activate the conda environment that we just created with this command:
source activate veoibd_synapse
And let’s see the program’s main help text:
$ veoibd_synapse --help
Usage: veoibd_synapse [OPTIONS] COMMAND [ARGS]...
Command interface to the veoibd-synapse-manager.
For command specific help text, call the specific command followed by the
--help option.
Options:
-c, --config DIRECTORY Path to optional config directory. If `None`,
configs/ is searched for *.yaml files.
--home Print the home directory of the install and exit.
--help Show this message and exit.
Commands:
configs Manage configuration values and files.
push Consume a push-config file, execute described...
Configuring¶
To generate fresh example configs in the config
directory, use the veoibd_synapse configs
command.
Usage: veoibd_synapse configs [OPTIONS]
Manage configuration values and files.
Options:
-l, --list Print the configuration values that will be
used and exit.
-g, --generate-configs Copy one or more of the 'factory default'
config files to the top-level config
directory. Back ups will be made of any
existing config files. [default: False]
-k, --kind [all|site|users|projects|push|pull]
Which type of config should we replace?
[default: all]
--help Show this message and exit.
- See Usage for more detailed information.
To generate a full set of fresh configs, run this:
veoibd_synapse configs --generate-configs
Uploading a batch of files¶
- You need to have a project created on Synapse for the files to be sent to.
- You will need to have created the appropriate configuration files:
- See the config section
- See Usage for more detailed information.
Lets take a look at the help text for the push
command:
$ veoibd_synapse push --help
Usage: veoibd_synapse push [OPTIONS]
Consume a push-config file, execute described transactions, save record of
transactions.
Options:
-u, --user TEXT Provide the ID for a user listed in the 'users' config
file.
--push-config PATH Path to the file where this specific 'push' is
configured.
--help Show this message and exit.
And a quick example command would be:
$ veoibd_synapse push -u GUSDUNN --push-config configs/GUSDUNN/new_WES_files.yaml
Usage¶
Setup Your Site’s Information¶
- Coming Soon.
Register New Data¶
This demo assumes that you have already gone through the demo: Setup Your Site’s Information
Generate a new “push” configuration file. Here we will give it the “demo” identifier. The command and its output are shown below.
$ veoibd_synapse configs --generate-configs --kind push --prefix demo [I 170824 12:15:18 main:41] Setup logging configurations. [D 170824 12:15:18 main:118] kind = ('push',) [I 170824 12:15:18 config:32] Generated new config: configs/demo.push.yaml
This gives you a blank config file to describe the data that you want to add to the system including:
- the project name
- annotation keywords to help filter the files on Synapse
- the location of the files you want to add to Synapse
- the location in Synapse where you want the files listed.
Amend the values in
configs/demo.push.yaml
to reflect the data files that you will add.I have already created and configured a user named “GUSDUNN” (see how to do this here).
An example command to register the data described in
demo.push.yaml
with Synapse is shown below.$ veoibd_synapse push -user GUSDUNN --push-config configs/demo.push.yaml
Sync Local Metadata Database¶
- Coming Soon.
Query Metadata Database¶
Download Data Files¶
- Coming Soon.
A few of the Makefile Commands¶
The Makefile contains the central entry points for common tasks related to this project.
- make github_remote launches a script to create and syncronize a remote github repository for this project.
- make environment uses the project’s requirements.txt file to create and provision a conda environment for this project. Then it registers the conda environment with your local jupyter set up as an ipython kernel with the same name as the conda environment.
- make clean_env undoes everything that make environment sets up.
- make serve_nb starts a jupyter notebook with this project’s notebooks directory as the root.
- make clean_bytecode removes all __pycache__ directories and leftover *.pyc files from the project.
For the full list:
`bash
> make help
`
Syncing data to S3¶
- make sync_data_to_s3 will use s3cmd to recursively sync files in data/ up to s3://[OPTIONAL] your-bucket-for-syncing-data (do not include ‘s3://’)/data/.
- make sync_data_from_s3 will use s3cmd to recursively sync files from s3://[OPTIONAL] your-bucket-for-syncing-data (do not include ‘s3://’)/data/ to data/.
Contributing¶
Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.
You can contribute in many ways:
Types of Contributions¶
Report Bugs¶
Report bugs at https://github.com/ScottSnapperLab/veoibd-synapse-data-manager/issues.
If you are reporting a bug, please include:
- Your operating system name and version.
- Any details about your local setup that might be helpful in troubleshooting.
- Detailed steps to reproduce the bug.
Fix Bugs¶
Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.
Implement Features¶
Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.
Write Documentation¶
veoibd-synapse-data-manager could always use more documentation, whether as part of the official veoibd-synapse-data-manager docs, in docstrings, or even on the web in blog posts, articles, and such.
Submit Feedback¶
The best way to send feedback is to file an issue at https://github.com/ScottSnapperLab/veoibd-synapse-data-manager/issues.
If you are proposing a feature:
- Explain in detail how it would work.
- Keep the scope as narrow as possible, to make it easier to implement.
- Remember that this is a volunteer-driven project, and that contributions are welcome :)
Get Started!¶
Ready to contribute? Here’s how to set up veoibd-synapse-data-manager for local development.
Fork the veoibd-synapse-data-manager repo on GitHub.
Clone your fork locally:
$ git clone git@github.com:your_name_here/veoibd-synapse-data-manager.git
Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:
$ mkvirtualenv veoibd-synapse-data-manager $ cd veoibd-synapse-data-manager/ $ python setup.py develop
Create a branch for local development:
$ git checkout -b name-of-your-bugfix-or-feature
Now you can make your changes locally.
When you’re done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:
$ flake8 veoibd_synapse tests $ python setup.py test or py.test $ tox
To get flake8 and tox, just pip install them into your virtualenv.
Commit your changes and push your branch to GitHub:
$ git add . $ git commit -m "Your detailed description of your changes." $ git push origin name-of-your-bugfix-or-feature
Submit a pull request through the GitHub website.
Pull Request Guidelines¶
Before you submit a pull request, check that it meets these guidelines:
- The pull request should include tests.
- If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.rst.
- The pull request should work for Python 3.3, 3.4 and 3.5. Check https://travis-ci.org/ScottSnapperLab/veoibd-synapse-data-manager/pull_requests and make sure that the tests pass for all supported Python versions.
Credits¶
Development Lead¶
- Gus Dunn <w.gus.dunn@gmail.com>
Contributors¶
None yet. Why not be the first?
Change Log¶
v0.1.0 / 2017-10-06¶
- Successfully parse R1 file names
- loaders.vcf: fixed imports
- loaders.vcf.add_parsed_info_col: yank needless lambda
- loaders.vcf: added cyvcf2 support and zygosity
- multigene.snake: formatting
- multigene.snake: added DEBUG metarule list
- multigene.snake: rearranged imports
- mulmultigene_LOF_search.snake: changed RUN.globals.input_vcfs
- multigene.snake: added VCF_CHECK
- added logging statements
- added some metadata to multigene pipeline
- set max mem in snpeff rules to 4g
- multigene.snake: altered way config files treated
- config.py: update replace_config
- misc.py: yank DAG stuff, add load_csv/nan_to_str
- errors.py remove logging
- logging.yaml: use top_level_logs to store certain logs
- update issue template
- add vscode to ignore
- switch to logzero
- multigene: switch to ruamel.yaml
- added explicit __all__ lists for imports in __init__.py files
- data.loaders.vcf: defined single identity func
- data.loaders.vcf: formatting and docstrings
- switched to snaketools
- removed extraneous print
- Snakefile: switch to logzero
- Snakefile: switch to ruamel.yaml
- updated requirements for tools in pipeline
- switched to snaketools
- Makefile: corrected install_python
- docs/conf.py is now responsible for sphinx-api call
- amended module docs title to ‘Source Code Documentation’
- allow files from sphinx-apidoc to be version controled.
- configs cmd now supports prefixes to group yamls
- docs/usage.rst: drafted demo “Register New Data”
- Preliminary switch to logzero logging - currently ignores logging.yaml config vals
- reorganized docs
- updated docs/requirements.txt
- pip install -e . succeeds (hopefully RTD will too)
- setup.py is now pypi-able
- Update for RTD
- Merged pypackage goodies and updated README
- merged Makefile with updated cookiecutter-data-science
- add multiple binary file types to ignored
- committed all from feature/sync-db
v0.0.4 / 2017-03-29¶
- TeamSubjectDatabase(SubjectDatabase) is functional
- ProjectSubjectDatabase(SubjectDatabase) is functional
- added cli.sycdb skeleton
v0.0.3 / 2017-03-20¶
- added rule “SNPSIFT_ANNOTATE”
- added basic logging boilerplate
- removed un-needed instantiation of synapse object in cli.push
- Merge branch ‘feature/CD55-filter’ into develop
- Merge branch ‘feature/multigene-filter’ into develop
- added src/veoibd_synapse/data/preprocessing/variant_tables.py
- Merge branch ‘feature/multigene-filter’ into develop
- added preprocessing package
- reorganized
- Merge branch ‘develop’ into feature/CD55-filter
- removed certain existing xls files from tracking
- Merge branch ‘feature/CD55-filter’ into develop
- ignore xls type files and libreoffice lock files
- Filter is functional.
- Merge branch ‘feature/count-gene-variants’ into develop
- Functional minimal gtf parser
- Added graph drawing rules to snakefile
Source Code Documentation¶
veoibd_synapse package¶
Subpackages¶
veoibd_synapse.cli package¶
Submodules¶
veoibd_synapse.cli.config module¶
Provide functions used in cli.config.
veoibd_synapse.cli.main module¶
Provide command line interface to the synapse manager.
veoibd_synapse.cli.pull module¶
Provide code devoted to downloading data from Synapse.
veoibd_synapse.cli.push module¶
Provide code devoted to uploading data to Synapse.
-
class
veoibd_synapse.cli.push.
BaseInteraction
(info)[source]¶ Bases:
object
Base class to manage information and execution for a single interaction with Synapse.
-
class
veoibd_synapse.cli.push.
Push
(main_confs, user, push_config)[source]¶ Bases:
object
Manage interactions with Synapse concerning adding/changing information on the Synapse servers.
-
_Push__base_info
()¶ Return a fresh basic info tree for a new interaction to update.
-
__init__
(main_confs, user, push_config)[source]¶ Initialize and validate basic information for a Push.
-
_get_remote_entity_dicts
()[source]¶ Query Synapse for all entity information related to this Project ID.
-
-
class
veoibd_synapse.cli.push.
PushInteraction
(info, push_obj)[source]¶ Bases:
veoibd_synapse.cli.push.BaseInteraction
Manage information and execution for a single “push” interaction with Synapse.
veoibd_synapse.cli.query module¶
Provide code devoted to querying Synapse.
veoibd_synapse.cli.syncdb module¶
Provide code devoted to retrieving and building the most up-to-date metadata database info from Synapse.
-
class
veoibd_synapse.cli.syncdb.
ProjectSubjectDatabase
(main_confs, syn, project_id)[source]¶ Bases:
veoibd_synapse.cli.syncdb.SubjectDatabase
Manage interactions with Synapse concerning accessing, downloading, and combining subject database files from a single Synapse Project.
-
__init__
(main_confs, syn, project_id)[source]¶ Initialize and validate basic information.
Parameters: - main_confs (dict-like) – refernce to main configuration tree.
- syn (Synapse) – an active synapse connection object.
- project_id (str) – Synapse ID for a project.
-
-
class
veoibd_synapse.cli.syncdb.
SubjectDatabase
(main_confs, syn)[source]¶ Bases:
object
Base class for managing interactions with Synapse concerning accessing, downloading, and combining subject database files from member-sites.
-
class
veoibd_synapse.cli.syncdb.
TeamSubjectDatabase
(main_confs, syn, team_name)[source]¶ Bases:
veoibd_synapse.cli.syncdb.SubjectDatabase
Manage interactions with Synapse concerning accessing, downloading, and combining subject database files from all member-sites in a Team.
-
__init__
(main_confs, syn, team_name)[source]¶ Initialize and validate basic information.
Parameters: - main_confs (dict-like) – refernce to main configuration tree.
- syn (Synapse) – an active synapse connection object.
- team_name (str) – the name of a Synapse Team.
-
build_project_dbs
()[source]¶ Iterate through project IDs building DB tables from each, storing the restults.
Stored as
ProjectSubjectDatabase
objects
-
Module contents¶
veoibd_synapse.data package¶
Subpackages¶
Provide helper utilities to extract_subids specific to BCH.
-
veoibd_synapse.data.extract_subids.utils.bch.
extract_subject_names
(file_names)[source]¶ Extract subject names from file_names and return subject_names.
-
veoibd_synapse.data.extract_subids.utils.bch.
make_class_masks
(subject_names)[source]¶ Define masks and store in self._class_masks.
-
veoibd_synapse.data.extract_subids.utils.bch.
process_r1
(file_names)[source]¶ Return the extracted and recoded subject_names from file_names.
-
veoibd_synapse.data.extract_subids.utils.bch.
recode_dashed_alphas
(subject_names, masks)[source]¶ Recode appropriate indices of self._subject_names.
-
veoibd_synapse.data.extract_subids.utils.bch.
recode_dashed_dots
(subject_names, masks)[source]¶ Recode appropriate indices of self._subject_names.
-
veoibd_synapse.data.extract_subids.utils.bch.
recode_fam_letters
(subject_names, masks)[source]¶ Recode appropriate indices of self._subject_names.
-
veoibd_synapse.data.extract_subids.utils.bch.
recode_subject_names
(subject_names, masks)[source]¶ Return the fully recoded subject_names.
-
veoibd_synapse.data.extract_subids.utils.bch.
test_dash_in
(x)[source]¶ Return True if x contains a dash.
-
veoibd_synapse.data.extract_subids.utils.bch.
test_ends_letter
(x)[source]¶ Return True if x ends with a letter.
Provide code to extract subject ID out of various forms used at BCH.
Provide code to build pyparsing objects that deal with GTF lines.
-
class
veoibd_synapse.data.parsers.GTF.
GTFLine
(seqname, source, feature, start, end, score, strand, frame, attributes, line_number=None)[source]¶ Bases:
object
-
attributes
¶
-
end
¶
-
feature
¶
-
frame
¶
-
line_number
¶
-
score
¶
-
seqname
¶
-
source
¶
-
start
¶
-
strand
¶
-
-
veoibd_synapse.data.parsers.GTF.
parse_gtf_file
(path)[source]¶ Parse full GTF file by yielding parsed GTF lines.
Commented text is ignored.
Parameters: path (Path) – Path obj pointing to GTF file. Yields: GTFLine – representing a parsed GTP line.
-
veoibd_synapse.data.parsers.GTF.
parse_gtf_line
(line, line_number=None)[source]¶ Parse a single line of GTF file into it’s columns, converting the attributes into a dict.
Parameters: - line (str) – One line of GTF formatted information.
- line_number (int|None) – Optional: number of the line this comes from in the file (starting from 1).
Returns: dict-like
Submodules¶
veoibd_synapse.data.asset_intake module¶
Code supporting the information discovery and assimilation of data/file assets.
-
class
veoibd_synapse.data.asset_intake.
Row
(path_hash, file_name, directory, batch_code, file_type, assay_type, bytes, subject_id)¶ Bases:
tuple
-
__getnewargs__
()¶ Return self as a plain tuple. Used by copy and pickle.
-
static
__new__
(_cls, path_hash, file_name, directory, batch_code, file_type, assay_type, bytes, subject_id)¶ Create new instance of Row(path_hash, file_name, directory, batch_code, file_type, assay_type, bytes, subject_id)
-
__repr__
()¶ Return a nicely formatted representation string
-
_asdict
()¶ Return a new OrderedDict which maps field names to their values.
-
classmethod
_make
(iterable, new=<built-in method __new__ of type object at 0xa395c0>, len=<built-in function len>)¶ Make a new Row object from a sequence or iterable
-
_replace
(_self, **kwds)¶ Return a new Row object replacing specified fields with new values
-
assay_type
¶ Alias for field number 5
-
batch_code
¶ Alias for field number 3
-
bytes
¶ Alias for field number 6
-
directory
¶ Alias for field number 2
-
file_name
¶ Alias for field number 1
-
file_type
¶ Alias for field number 4
-
path_hash
¶ Alias for field number 0
-
subject_id
¶ Alias for field number 7
-
-
veoibd_synapse.data.asset_intake.
build_asset_table
(asset_conf, pathify=True)[source]¶ Return asset table as
pd.DataFrame
built fromasset_conf
info.- Column Discriptions:
- path_hash (int)
- file_name (str)
- directory (str)
- batch_code (Category)
- Regeneron1, Merck1, Merck2, etc
- file_type (Category)
- BAM, VCF, GVCF, FASTQ, etc
- assay_type (Category)
- WES, WGS, RNAseq, etc
- bytes (int)
- subject_id (str)
Parameters: - asset_conf (
dict
-like) – configuration tree built from asset_intake configuration file. - pathify (
bool
) – whether or not to runpathify_assets()
on the paths inasset_conf
Returns: pd.DataFrame
Module contents¶
veoibd_synapse.rules package¶
Module contents¶
Provide code supporting the running and automating of Snakemake rules.
-
veoibd_synapse.rules.
apply_template
(template, keywords)[source]¶ Return a list of strings of form
template
with values inkeywords
inserted.Parameters: - template (
str
) – a string containing keywords ({kw_name}
). - keywords (
dict
-like) – dict with keys of appropriate keyword names and values as equal length ORDERED lists with the correct values to be inserted.
- template (
-
veoibd_synapse.rules.
pathify_by_key_ends
(dictionary)[source]¶ Return a dict that has had all values with keys containing the suffixes: ‘_PATH’ or ‘_DIR’ converted to Path() instances.
Parameters: dictionary (dict-like) – Usually the loaded, processed config file as a dict. Returns: Modified version of the input. Return type: dict-like
-
class
veoibd_synapse.rules.
SnakeRun
(cfg, snakefile)[source]¶ Bases:
object
Initialize and manage information common to the whole run.
Submodules¶
veoibd_synapse.dag_tools module¶
Provide functions for working with our DAGs.
-
class
veoibd_synapse.dag_tools.
ProjectDAG
(project_id, synapse_session)[source]¶ Bases:
networkx.classes.digraph.DiGraph
Class to generate and manage our project structure.
-
class
veoibd_synapse.dag_tools.
SynNode
(entity_dict, synapse_session=None, is_root=False)[source]¶ Bases:
munch.Munch
Provide methods and attributes to model an entity node in a DAG of Synapse Entities.
veoibd_synapse.errors module¶
Provide error classes for veoibd-synapse-data-manager.
-
exception
veoibd_synapse.errors.
NoResult
[source]¶ Bases:
veoibd_synapse.errors.VEOIBDSynapseError
Raise when an iteration has nothing to return, but normally would.
-
exception
veoibd_synapse.errors.
NotImplementedYet
(msg=None)[source]¶ Bases:
NotImplementedError
,veoibd_synapse.errors.VEOIBDSynapseError
Raise when a section of code that has been left for another time is asked to execute.
-
exception
veoibd_synapse.errors.
VEOIBDSynapseError
[source]¶ Bases:
Exception
Base error class for veoibd-synapse-data-manager.
-
exception
veoibd_synapse.errors.
ValidationError
[source]¶ Bases:
veoibd_synapse.errors.VEOIBDSynapseError
Raise when a validation/sanity check comes back with unexpected value.
veoibd_synapse.interface module¶
Provide a representation of the interactions between a Synapse Project and other Synapse Entities.
-
class
veoibd_synapse.interface.
VEOProject
(name=None, annotations=None, synapse_client=None, config_tree=None, **kwargs)[source]¶ Bases:
object
Manage a collection of Synapse Entities common to a single project.
-
__init__
(name=None, annotations=None, synapse_client=None, config_tree=None, **kwargs)[source]¶ Initialize an empty ProjectData object.
-
_get_project_entity
()[source]¶ Set self.project after retrieving the synapse object by name, create the Project if it does not exist.
-
veoibd_synapse.misc module¶
Provide misc common functions to the rest of the CLI.
-
veoibd_synapse.misc.
chunk_md5
(path, size=1024000)[source]¶ Calculate and return the md5-hexdigest of a file in chunks of size.
Module contents¶
Top-level package for veoibd_synapse.