tmtk - TranSMART data curation toolkit¶
Author: | Jochem Bijlard |
---|---|
Source Code: | https://github.com/thehyve/tmtk/ |
Generated: | Apr 17, 2018 |
License: | GPLv3 |
Version: | 0.4.4 |
Philosophy
A toolkit for ETL curation for the tranSMART data warehouse for translational research.
The TranSMART curation toolkit (tmtk
) aims to provide a language
and set of classes for describing data to be uploaded to tranSMART.
The toolkit can be used to edit and validate studies prior
to loading them with transmart-batch.
- Functionality currently available:
- create a transmart-batch ready study from clinical data files.
- load an existing study and validate its contents.
- edit the transmart concept tree in The Arborist graphical editor.
- create chromosomal region annotation files.
- map HGNC gene symbols to corresponding Entrez gene IDs using mygene.info.
Note
tmtk is a python3
package meant to be run in Jupyter notebooks
. Results
for other setups may vary.
Basic Usage¶
Step 1: Opening a notebook¶
First open a Jupyter Notebook, open a shell and change directory to some place where your data is. Then start the notebook server:
cd /path/to/studies/
jupyter notebook
This should open your browser to Jupyters file browser, create a new notebook for here.
Step 2: Using tmtk¶
# First import the toolkit into your environment
import tmtk
# Then create a <tmtk.Study> object by pointing to study.params of a transmart-batch study
study = tmtk.Study('~/studies/a_tm_batch_ready_study/study.params')
# Or, by using the study wizard on a directory with correctly structured, clinical data files.
# (Visit the transmart-batch documentation to find out what is expected.)
study = tmtk.wizard.create_study('~/studies/dir_with_some_clinical_data_files/')
Now we have loaded the study as a tmtk.Study
object we have some
interesting functions available:
# Check whether transmart-batch will find any issues with the way your study is setup
study.validate_all()
# Graphically manipulate the concept tree in this study by using The Arborist
study.call_boris()
Contents¶
Changelog¶
Version 0.4.4
- Support for Excel templates for 17.1+
- Added data density for the random study generator
- Added package wide options under tmtk.options
- Added builds for Anaconda
- Automated testing on Windows
Version 0.4.2
- Fixed call_boris and related tests and examples.
Version 0.4.1
- Data types and modifiers support in blueprint.
- Fixed issue with empty date columns
- Export studies without including a top node
- Better support for modifiers other than MISSVAL
Version 0.4.0
- Added support to export to skinny format (toolbox.SkinnyExport)
- Support for modifiers and ontology concepts
- known issue: the Arborist does not have full support yet.
- Variable objects are more powerful with more setters.
Version 0.3.5
- Better support for building pipelines from code books using Blueprints
- Set data label, concept path, and word mapping from clinical variable abstraction
- Arborist support for _ and +
- Improved stability of Arborist
- Fixes in Validator for word map file
Version 0.3.3
- More easily extensible validator functionality
- Added multiple validation methods
- Fix issue with namespace cleaner
Version 0.3.1
- Replaced deprecated pandas functionality
- More reliably start batch job
Version 0.3.0
- Create studies from TraIT data templates, see Data templates.
- Create fully randomized studies of any size:
tmtk.toolbox.RandomStudy
. - Load data right from Jupyter using transmart-batch, with progress bars!! Also works in
as a command line tool
transmart-batch
. - Set name and id from the main study object.
Version 0.2.2
- Minor bug fix for Arborist installation
Version 0.2.1
- The Arborist is now implemented as a Jupyter Notebook extension
- Metadata tags are automatically sorted in Arborist.
Version 0.2.0
- Create and apply tree templates in Arborist
- Improved interaction with metadata tags in Arborist
- Resolved issues with the validator
- R is now an optional dependency
User examples¶
These examples have been extracted from Jupyter Notebooks.
Create study from clinical data.¶
tmtk has a wizard that can be used to quickly go from clinical data files to a study object. The main goal of this functionality is to reduce the barrier of setting up all transmart-batch specific files (i.e. parameter files, column mapping and word mapping files).
The way to use it is to call tmtk.wizard.create_study(path)
, where
path points a directory with clinical data files.
Note: clinical datafiles have to be in a format that is accepted by transmart-batch.
Here we will create a study from these two files:
import os
files_dir = './studies/wizard/'
os.listdir(files_dir)
['Cell-line_clinical.txt', 'Cell-line_NHTMP.txt']
# Load the toolkit
import tmtk
# Create a study object by running the wizard
study = tmtk.wizard.create_study('./studies/wizard/')
##### Please select your clinical datafiles #####
- 0. /home/vlad-the-impaler/tmtk/studies/wizard/Cell-line_clinical.txt
- 1. /home/vlad-the-impaler/tmtk/studies/wizard/Cell-line_NHTMP.txt
Pick number: 0
Selected files: ['Cell-line_clinical.txt']
Pick number: 1
Selected files: ['Cell-line_clinical.txt', 'Cell-line_NHTMP.txt']
Pick number:
✅ Adding 'Cell-line_clinical.txt' as clinical datafile to study.
✅ Adding 'Cell-line_NHTMP.txt' as clinical datafile to study.
The wizard walked us through some of the options for the study we want
to create. Our new study is a public study with STUDY_ID==WIZARD
and
you can pick an appropriate name by setting the study.study_name = 'Ur a wizard harry'
.
None of the clinical params have been set, so tmtk
will use default names for the column and word mapping file. Next the
datafiles have been loaded and the column mapping object has been
created to include the data files.
Next we will run the validator and find out that some files cannot be found. This is expected as these objects are only in memory and not yet on disk.
study.validate_all(5)
⚠ No valid file found on disk for /home/vlad-the-impaler/tmtk/studies/wizard/clinical/word_mapping_file.txt, creating dataframe.
Validating params file at clinical
❌ WORD_MAP_FILE=word_mapping_file.txt cannot be found.
❌ COLUMN_MAP_FILE=column_mapping_file.txt cannot be found.
Detected parameter WORD_MAP_FILE=word_mapping_file.txt.
Detected parameter COLUMN_MAP_FILE=column_mapping_file.txt.
Validating params file at study
Detected parameter TOP_NODE=\Public Studies\You're a wizard Harry\.
Detected parameter STUDY_ID=WIZARD.
Detected parameter SECURITY_REQUIRED=N.
Of course, we want to write our study to disk so it can be loaded with transmart-batch.
study = study.write_to('~/studies/my_new_study')
Writing file to /home/vlad-the-impaler/studies/my_new_study/clinical/clinical.params
Writing file to /home/vlad-the-impaler/studies/my_new_study/study.params
Writing file to /home/vlad-the-impaler/studies/my_new_study/clinical/column_mapping_file.txt
Writing file to /home/vlad-the-impaler/studies/my_new_study/clinical/Cell-line_clinical.txt
Writing file to /home/vlad-the-impaler/studies/my_new_study/clinical/word_mapping_file.txt
Writing file to /home/vlad-the-impaler/studies/my_new_study/clinical/Cell-line_NHTMP.txt
Next you can use the TranSMART Arborist to modify the concept tree or
use tmtk to load to transmart if you’ve set your $TMBATCH_HOME
, see Using transmart-batch from Jupyter.
TranSMART Arborist¶
GUI editor for the concept tree.¶
First load the toolkit.
import tmtk
Create a study object by entering a “study.params” file.
study = tmtk.Study('../studies/valid_study/study.params')
To verify the study object is compatible with transmart-batch for loading you can run the validator
study.validate_all()
Validating Tags:
❌ Tags (2) found that cannot map to tree: (1. Cell line characteristics∕1. Cell lines∕Age and 1. Cell line characteristics∕1. Cell lines∕Gender). You might want to call_boris() to fix them.
We will ignore this issue for now as this will be fixed automatically when calling the Arborist GUI.
The GUI allows a user to interactively edit all aspects of TranSMART’s concept tree, this include:
- Concept Paths from the clinical column mapping.
- Word mapping from clinical data files.
- High dimensional paths from subject sample mapping files.
- Meta data tags
# In a Jupyter Notebook, this brings up the interactive concept tree editor.
study.call_boris()
Once returned from The Arborist to Jupyter environment we can write the updated files to disk. You can then run transmart-batch on that study to load it into your tranSMART instance.
study.write_to('~/studies/updated_study')
Collaboration with non technical users.¶
Though using Jupyter Notebooks is great for technical users, less technical domain experts might quickly feel discouraged. To allow for collaboration with these users we will upload this concept tree to a running Boris as a Service webserver. This will allow others to make refinements to the concept tree.
study.publish_to_baas('arborist-test-trait.thehyve.net')
Once the study is updated in BaaS, we can update the local files by copying the url for the latest tree into this command.
study.update_from_baas('arborist-test-trait.thehyve.net/trees/valid-study/3/~edit')
Using transmart-batch from Jupyter¶
Using tmtk you can load data to transmart right from Jupyter. For this to work you need to download and build transmart-batch, if you want to do this see the transmart-batch github.
Once you’ve done that you need to set an environment variable to the path of the github repository. The easiest way to do this is to add the following to your ~/.bash_profile:
export $TMBATCH_HOME=/home/path/to/transmart-batch
Next make sure to create a batchdb.property file with an appropriate name in the $TMBATCH_HOME
directory. tmtk will look for any *.property file and allow you run transmart-batch with that property
file from many objects. An examples of a good names are production.properties or test-environment.properties.
Next you will be able to do something like this:
study.load_to.production()
Data formats overview¶
Study folder format¶
When loading a study structure to TMTK a folder format is expected, the same structure is supported in transmart-batch.
File structure¶
study_directory
├── study.params
│
├── clinical
│ ├── clinical.params
│ ├── column_mapping.txt
│ ├── word_mapping.txt
│ ├── modifiers.txt
│ ├── ontology_mapping.txt
│ ├── trial_visits.txt
│ ├── data_file_1.txt
│ ├── ...
│ └── data_file_X.txt
│
├── expression
│ ├── annotation
│ │ ├── mrna_annotation.params
│ │ └── mrna_annotation_file.txt
│ ├── mrna.params
│ ├── subject_sample_mapping.txt
│ └── expression_data_file.txt
│
├── ...
│
├── rnaseq
│ ├── annotation
│ │ ├── rnaseq_annotation.params
│ │ └── rnaseq_annotation_file.txt
│ ├── rnaseq.params
│ ├── subject_sample_mapping.txt
│ └── rnaseq_data_file.txt
│
└── tags
├── tags.params
└── tags.txt
Study parameters¶
A parameter file in which the study wide parameters are stored like the study identifier and whether a study needs to be securly loaded.
STUDY_ID
Mandatory Identifier of the study.SECURITY_REQUIRED
_Default:_ Y. Defines study as Private (Y) or Public (N).TOP_NODE
The study top node. Has to start with ‘' (e.g. ‘\Public Studies\Cell-lines’). Default: ‘\(Public|Private) Studies\<STUDY_ID>’.
Clinical Data¶
Clinical data is meant for all kind of measurements not falling into other categories. It can be data from questionnaires, physical body measurements or socio-economic info about patient.
File structure¶
study_directory
└── clinical
├── clinical.params
├── column_mapping.txt
├── word_mapping.txt
├── modifiers.txt
├── ontologies.txt
├── trial_visits.txt
├── data_file_1.txt
├── ...
└── data_file_X.txt
Parameter file¶
COLUMN_MAP_FILE
. Mandatory. Points to the column mapping file.WORD_MAP_FILE
. Points to the file with dictionary to be used.MODIFIERS
. Points to the modifier file of the study. Only needed when using modifiersONTOLOGY_MAP_FILE
. Points to the ontology mapping for this study. Only needed when using ontologiesTRIAL_VISITS_FILE
. Points to the trial visits file for this study. Only needed when using trial_visits
File formats¶
Column mapping file¶
A tab separated file with seven columns:
Filename
. Filename of the data file refering to the dataCategory Cd
. Concept path to be displayed in tranSMARTColumn number
. Column number from the data fileData Label
. Data label to display in tranSMARTReference column
. Column to which the data from this column will refer to, used by modifiers. Can be a comma (,) separated list to indicate a range of columns.Ontology code
. Ontology code from the ontology mapping fileConcept Type
. Type of the concept, see Allowed values for Concept type for more information
Example column mapping file:
Filename | Category Cd | Col Num | Data Label | Ref Col | Ontology code | Concept Type |
---|---|---|---|---|---|---|
data.txt | Subjects | 1 | SUBJ_ID | |||
data.txt | Subjects | 2 | Age | NUMERICAL | ||
data.txt | Subjects | 3 | Gender | CATEGORICAL | ||
data.txt | Subjects | 4 | Drug | CATEGORICAL | ||
data.txt | Subjects | 5 | MODIFIER | 4 | DOSE | |
data.txt | Subjects | 6 | MODIFIER | SAMPLE_ID | ||
data.txt | 7 | TRIAL_VISIT | ||||
data.txt | 8 | START_DATE |
Adding modifiers can be done by indicating MODIFIER in the Data Label
column and indicating a MODIFIER CODE
in the Concept Type
column. Adding a column number in the Reference column
assigns a modifier to the observations from the reference column. Note that you can indicate multiple references by adding a comma (,) seperated list. Leaving the Reference column
empty means the modifier will be applied to all columns from that data file.
Trial visits, start and end dates are all applied row wide and do not require references. The start and end date do expect a set date format (see Reserved keywords). The value entered for a trial visit in the data file should also be defined in the TRIAL_VISIT_FILE
with the same label.
Reserved keywords for Data label
:
SUBJ_ID
. Needs to be indicated once per data fileMODIFIER
. Requires a modifier code from the modifier table to be inserted in theConcept Type
columnTRIAL_VISIT
. Values from the data file need to be specified in theTRIAL_VISIT_FILE
.START_DATE
. Required date format: yyyy-mm-dd hh:mm:ssEND_DATE
. Required date format: yyyy-mm-dd hh:mm:ss
Allowed values for Concept type
:
NUMERICAL
. For numerical data, defaultCATEGORICAL
. For categorical text values. Can be used to force numerical data to be loaded as categorischDATE
. For date values. Expected date format: yyyy-mm-dd hh:mm:ssTEXT
. For free text. Observations are stored as a BLOB object and can only be used to select for people who have an observation for this.MODIFIER CODE
. Codes from the modifier table. Any code defined in the modifier table can be inserted to indicate which modifier should be linked.
Word mapping file¶
A tab separated file with four columns:
Filename
. Filename of the data file refering to the dataColumn number
. Column number from the file to which the substitution should be doneFrom value
. Value to be replacedTo value
. New value
Example word mapping file:
Filename | Col Num | From Value | To Value |
---|---|---|---|
data.txt | 3 | M | Male |
data.txt | 3 | F | Female |
data.txt | 4 | ASP | Aspirin |
data.txt | 4 | PAC | Paracetamol |
Trial visit file¶
A tab separated file with three columns:
Visit name
. Mandatory Name of the visit, displayed in the tranSMART UIRelative time
. Integer indicating the length of timeTime unit
. Unit of time, possible values: Days, Weeks, Months, Years
The only mandatory field is the Visit name
.
Example trial visit file:
Visit name | Relative time | Time unit |
---|---|---|
Baseline | 0 | Months |
Treatment | 3 | Months |
Follow up | 6 | Months |
Preoperative | ||
Postoperative |
Modifier file¶
A tab separated file with four columns:
Modifier path
. Path of the modifier.Modifier code
. Unique modifier code. Used in the column mapping file asConcept type
Name charater
. Label of the modifierData type
. Data type of the modifier, options CATEGORICAL or NUMERICAL
modifier path | modifier code | name char | data type |
---|---|---|---|
\Dose | DOSE | Drug dose administered | NUMERICAL |
\Samples | SAMPLE_ID | Modifier for Samples | CATEGORICAL |
Ontology file¶
To be implemented
Clinical Data file(s)¶
The clinical data file contains the low-dimensional observations of each patient. The file name and columns are referenced from the `column mapping file. Each data file must contain a column with the patient identifiers.
Note: In following examples, each variation on the basic structure of clinical data files is shown separately for clarity reasons. However, none of them are mutually exclusive.
Basic structure¶
The basic structure of a clinical data file is patients on the rows and variables on the columns.
Subject_id | Gender | Treatment arm |
---|---|---|
patient1 | Male | A |
patient2 | Female | B |
Adding Observation dates¶
When observations are linked to a specific date or time, additional columns for the start date and optionally end date can be added. All observations present in a row with an observation date will be considered to have that observations date. Start and end date should be provided in YYYY-MM-DD format and may be acompanied by the time of day in HH:MM:SS format (e.g. 2016-08-23 11:39:00). Please see Column mapping file for information on how to represent this correctly in the column mapping file.
Subject_id | Start date | End date | Gender | Treatment arm | BMI |
---|---|---|---|---|---|
patient1 | Male | A | |||
patient1 | 2016-03-18 | 2016-03-18 | 22.7 | ||
patient2 | Female | B | |||
patient2 | 2016-03-24 | 2016-03-24 | 20.9 |
Adding Trial visits¶
When one or multiple observations where acquired as part of a clinical trial, they can be mapped as such by adding a Trial visit label column. All observations in a row will be considered part of the same trial visit. The trial visit labels should be defined in the trial visit mapping. See Trial visit file for more information.
Subject_id | Trial visit label | Gender | Treatment arm | BMI | Heart rate |
---|---|---|---|---|---|
patient1 | Male | A | |||
patient1 | Baseline | 22.7 | 87 | ||
patient1 | Week 5 | 22.6 | 91 | ||
patient2 | Female | B | |||
patient2 | Baseline | 20.9 | 82 | ||
patient2 | Week 5 | 20.5 | 82 |
Adding Sample-specific data and Custom modifers¶
Samples are currently recognized by adding modifiers to your observations.
To indicate samples it is recommended to use the SAMPLE_ID
modifier.
The modifier can be added by adding a column with the sample identifiers to the data file.
When applied, all observations on a row will be linked to the sample identifier.
Next to row-wide modifiers it is also possible to add modifiers for a specific column. These modifiers follow the same
rules as the SAMPLE_ID
modifier apart from the fact they only apply to observations with in the specified columns they are
connected to.
For an overview on how to add your own custom modifiers and how to represent these in the column mapping file please see: Modifier file and Column mapping file. Note: The column mapping file determines if a modifier is interpreted as row-wide or column specific, see: Defining modifiers in the column mapping.
Example modifier table, SAMPLE_ID and DOSE are modifiers:
Subject_id | SAMPLE_ID | Hypermutated | MVD | Drug | DOSE |
---|---|---|---|---|---|
patient1 | GSM210005 | No | 51.26 | Paracetamol | 50 |
patient2 | GSM210043 | No | 27.91 | Aspirin | 100 |
patient2 | GSM210047 | Yes | 77.03 | Paracetamol | 500 |
Metadata tags and description¶
The metadata which will appear in a popup in the tree of tranSMART. Can be used to add additional information to your concepts.
File structure¶
study_directory
└── tags
├── tags.params
└── tags.txt
Parameter file¶
The parameters file should be named tags.params
and contains:
TAGS_FILE
Mandatory. Points to the tags file. See below for format.
File format¶
The metadata files are expected a flat tab seperated text file with four columns:
Concept path
. Indicate to which concept the metadata belongs. Metadata on the study level is indicated with a ‘\’Tag title
. Title of the metadata to be displayedTag description
. Description of the fieldWeight
. Determines order of the metadata in transmart, the higher the number the lower the tag will appear
Example input file:
concept path | tag title | tag description | Weight |
---|---|---|---|
\ | ORGANISM | Homo Sapiens | 2 |
\Subjects\Age | Info | At time of diagnosis | 3 |
NOTE: The header row is mandatory, the column order is set but the column names are flexible.
High dimensional and omics data types¶
High dimensional data parameters¶
- DATA_FILE Mandatory (alternatively with DATA_FILE_PREFIX). _Prefer this to_ DATA_FILE_PREFIX. Points to the HD data file.
- DATA_FILE_PREFIX ___deprecated___ because it doesn’t behave like a prefix (unlike the original pipeline); use DATA_FILE instead.
- DATA_TYPE Mandatory; must be R (raw values) or L (log transformed values).
- LOG_BASE _Default:_ 2. If present must be 2. The log base for calculating log values.
- SRC_LOG_BASE Has to be specified only with DATA_TYPE=L. Specifies which logarithm base was used for transforming the data values.
- MAP_FILENAME Mandatory. Filename of the mapping file.
- ALLOW_MISSING_ANNOTATIONS _Default:_ N. Y for yes, N for no. Whether the job should be allowed to continue when the data set doesn’t provide data for all the annotations (here probes).
- SKIP_UNMAPPED_DATA _Default:_ N. If Y then it ignores data points that have no subject mapping. Otherwise (N) gives an error for such data points.
- ZERO_MEANS_NO_INFO _Default:_ N. If Y then the rows with raw values equal 0 would be filtered out. Otherwise (N) they will be inserted to the database.
- The flag applies to most of HD data types. It does not effect CNV (ACGH) data. For RNAseq read count data the check on zeros happens based on normalized read count.
Place holder for overview.
Backout¶
This job removes data from the database. This job is modular; you can choose what data to delete.
This job will refuse to fully remove a study if it the study has data that this job does not support removing. Beware of this limitation, as it limits the useful of this job pending the implementation of the remaining modules.
Note that running this job requires a backout.params
file, which is
not very convenient. You can still create an empty backout.params
file and specify all the parameters on the command line. E.g.:
touch /tmp/backout.params
./transmart-batch-capsule.jar -p /tmp/backout.params -d STUDY_ID=GSE8581
Available parameters¶
INCLUDED_TYPES
– the modules to include, comma separated. Cannot be specified ifEXCLUDED_TYPES
is specified. If neither is specified, defaults to all the modules. Thefull
module that cannot be explicitly included (the only way to run it is to leaveINCLUDED_TYPES
andEXCLUDED_TYPES
blank).EXCLUDED_TYPES
– include all the modules except those included in this comma separated list. The modulefull
is automatically excluded if this parameter is not blank. See alsoINCLUDED_TYPES
.
You could also use the study-specific parameters.
Overview¶
This job will run a few common steps at the beginning and at the end. In the middle, it will run sequentially the specified modules. Each module has two phases – in the first, it determines whether data whose deletion it handles exists on the database; the second phase is only invoked if such data indeed exists and it handles the data’s deletion.
The full
module is special. It always runs last and it aborts the
job if it finds concepts or assays belonging to the study in question
(apart from the top node). If it doesn’t, it proceeds to the deletion of
the top node and the study patients. No other module deletes patients,
since data for all of the data types depends on them being present.
Modules¶
Available modules at this point:
clinical
– deletes clinical data and clinical data related only concepts. Does not delete patients.full
– deletes the study top node and the study patients, provided no other data remains.
# Loading tranSMART data
## For all data types * Parameters for all data types
- [study-params.md](study-params.md) - Parameters that are supported for all data types.
- Metadata * [tags.md](tags.md) - Loading study, concept, patient metadata and links to source data per concept.
## Low-dimensional data * Clinical data
- [clinical.md](clinical.md) - Loading numerical and categorical low-dimensional data from clinical, non-high throughput molecular profiling, derived imaging data, biobanking data or links to source data per patient. * [templates.md](templates.md) - Using templates in the clinical data paths. * [xtrial.md](xtrial.md) - Uploading across trial clinical data.
## High-dimensional data * General high-dimensional data processing
- [hd-params.md](hd-params.md) - Parameters that are supported for all high-dimensional data types.
- [chromosomal_region.md](chromosomal_region.md) - Tabular file structure for loading chromosomal regions.
- [subject-sample-mapping.md](subject-sample-mapping.md) - Tabular file structure for loading subject sample mappings for HD data.
- mRNA gene expression data * [expression.md](expression.md) - Loading microarray gene expression data. * under development - Loading readcounts and normalized readcounts data for mRNAseq and miRNAseq.
- Copy Number Variation data * [cnv.md](cnv.md) - Loading CNV data from Array CGH (comparative genomic hybridisation), SNP Array, DNA-Seq, etc.
- Small Genomic Variants * not yet implemented - Loading small genomic variants (SNP, indel in VCF format) from RNAseq or DNAseq.
- Proteomics data * [proteomics.md](proteomics.md) - Loading protein mass spectrometry data as peptide or protein quantities.
- RnaSeq data * [rnaseq.md](rnaseq.md) - Loading gene region RNASeq data as read counts and normalized read counts.
- Metabolomics data * [metabolomics.md](metabolomics.md) - Loading metabolite quantities.
- GWAS data * [gwas.md](gwas.md) - Loading Genome Wide Association Study data.
# Other * Unloading tranSMART data
- [backout.md](backout.md) - Deleting data from tranSMART.
- Loading I2B2 data * [i2b2.md](i2b2.md) - Loading data to I2B2 with transmart-batch.
API Description¶
Study class¶
-
class
tmtk.
Study
(study_params_path=None, minimal=False)[source]¶ Bases:
tmtk.utils.validate.ValidateMixin
Describes an entire TranSMART study. This is the main object used in tmtk. Studies can be initialized by pointing to a study.params file. This study has to be structured according to specification for transmart-batch.
>>> import tmtk >>> study = tmtk.Study('./studies/valid_study/study.params')
This will create the study object which can be used as a starting point for custom curation or directly in The Arborist.
To use the more limited 16.2 data model with transmart-batch set this option before creating this object.
>>> tmtk.options.transmart_batch_mode = True
-
all_files
¶ All file objects in this study.
-
annotation_files
¶ All annotation file objects in this study.
-
apply_blueprint
(blueprint, omit_missing=False)[source]¶ Apply a blueprint to current study.
Parameters: - blueprint – blueprint object (e.g. dictionary) or link to blueprint json on disk.
- omit_missing – if True, then variable that are not present in the blueprint
will be set to OMIT.
-
call_boris
(height=650)[source]¶ Launch The Arborist GUI editor for the concept tree. This starts a Flask webserver in an IFrame when running in a Jupyter Notebook.
While The Arborist is opened, the GIL prevents any other actions. :param height: set the height of the output cell
-
clinical_files
¶ All clinical file objects in this study.
-
concept_tree
¶ ConceptTree object for this study.
-
concept_tree_json
¶ Stringified JSON that is used by JSTree in The Arborist.
-
concept_tree_to_clipboard
()[source]¶ Send stringified JSON that is used by JSTree in The Arborist to clipboard.
-
ensure_metadata
()[source]¶ Create the Tags object for this study. Does nothing if it is already present.
-
find_annotation
(platform=None)[source]¶ Search for annotation data with this study and return it.
Parameters: platform – platform id to look for in this study. Returns: an Annotations object or nothing.
-
find_params_for_datatype
(datatypes=None)[source]¶ Search for parameter files within this study object and return them as list.
Parameters: datatypes – single string datatype or list of strings Returns: a list of parameter objects for specific datatype in this study
-
get_objects
(of_type)[source]¶ Search for objects that have inherited from a certain type.
Parameters: of_type – type to match against. Returns: generator for the found objects.
-
high_dim_files
¶ All high dimensional file objects in this study.
-
load_to
¶
-
publish_to_baas
(url, study_name=None, username=None)[source]¶ Publishes a tree on a Boris as a Service instance.
Parameters: - url – url to a instance (e.g. http://transmart-arborist.thehyve.nl/).
- study_name – a nice name.
- username – if no username is given, you will be prompted for one.
Returns: the url that points to the study you’ve just uploaded.
-
sample_mapping_files
¶ All subject sample mapping file objects in this study.
-
security_required
¶
-
study_blob
¶ JSON data that can be loaded in the study blob. This will be added as a separate file next to the study.params. The STUDY_JSON_BLOB parameter will be set to point to this file.
-
study_id
¶ The study ID as it is set in study params.
-
study_name
¶ The study name, extracted from study param TOP_NODE.
-
tag_files
¶
-
top_node
¶
-
update_from_baas
(url, username=None)[source]¶ Give url to a tree in BaaS.
Parameters: - url – url that has both the study and version of a tree in BaaS (e.g. http://transmart-arborist.thehyve.nl/trees/study-name/1/~edit/).
- username – if no username is given, you will be prompted for one.
-
update_from_treefile
(treefile)[source]¶ Give path to a treefile (from Boris as a Service or otherwise) and update the current study to match made changes.
Parameters: treefile – path to a treefile (stringified JSON).
-
validate_all
(verbosity='WARNING')[source]¶ Validate all items in this study.
Parameters: verbosity – only display output of this level and above. Levels: ‘debug’, ‘info’, ‘okay’, ‘warning’, ‘error’, ‘critical’. Default is ‘WARNING’. Returns: True if no errors or critical is encountered.
-
write_to
(root_dir, overwrite=False, return_new=True)[source]¶ Write this study to a new directory on file system.
Parameters: - root_dir – the base directory to write the study to.
- overwrite – set this to True to overwrite existing files.
- return_new – if True load the study object from the new location and return it.
Returns: new study object if return_new == True.
-
Params classes¶
Params Container¶
-
class
tmtk.params.Params.
Params
(study_folder=None)[source]¶ Bases:
tmtk.utils.validate.ValidateMixin
Container class for all params files, called by Study to locate all params files.
-
add_params
(path, parameters=None)[source]¶ Add a new parameter file to the Params object.
Parameters: - path – a path to a parameter file.
- new – if new, create parameter object.
- parameters – add dict here with parameters if you want to create a new parameter file.
-
static
create_params
(path, parameters=None, subdir=None)[source]¶ Create a new parameter file object.
Parameters: - path – a path to a parameter file.
- parameters – add dict here with parameters if you want to create a new parameter file.
- subdir – subdir is used as string representation.
Returns: parameter file object.
-
Base class: ParamsBase¶
AnnotationParams¶
ClinicalParams¶
HighDimParams¶
-
class
tmtk.params.HighDimParams.
HighDimParams
(path=None, parameters=None, subdir=None, parent=None)[source]¶ Bases:
tmtk.params.base.ParamsBase
-
docslink
= 'https://github.com/thehyve/transmart-batch/blob/master/docs/hd-params.md'¶
-
is_viable
()[source]¶ Returns: True if both the datafile and map file are located, else returns False.
-
mandatory
¶
-
optional
¶
-
StudyParams¶
Clinical classes¶
Clinical Container¶
-
class
tmtk.clinical.
Clinical
(clinical_params=None)[source]¶ Bases:
tmtk.utils.validate.ValidateMixin
Container class for all clinical data related objects, i.e. the column mapping, word mapping, and clinical data files.
This object has methods that add data files, and for lookups of clinical files and variables.
-
ColumnMapping
¶
-
add_datafile
(filename, dataframe=None)[source]¶ Add a clinical data file to study.
Parameters: - filename – path to file or filename of file in clinical directory.
- dataframe – if given, add pd.DataFrame to study.
-
all_variables
¶ Dictionary where {tmtk.VarID: tmtk.Variable} for all variables in the column mapping file.
-
apply_blueprint
(blueprint, omit_missing=False)[source]¶ Update the column mapping by applying a template.
Parameters: - blueprint –
expected input is a dictionary where keys are column names as found in clinical datafiles. Each column header name has a dictionary describing the path and data label and other information. For example:
- {
- “GENDER”: {
- “path”: “CharacteristicsDemographics”,
“label”: “Gender”,
“concept_code”: “SNOMEDCT/263495000”,
“metadata_tags”: {”Info”: “As measured when born.”
}, “force_categorical”: “Y”, “word_map”: {
”goo”: “values”, “pile”: “list”}, “expected_categorical”: [
”pile”, “of”, “goo”]
}, “BPBASE”: {
”path”: “Lab resultsBlood”, “label”: “Blood pressure (baseline)”, “expected_numerical”: {”min”: 1, “max”: 9}
}
}
- omit_missing – if True, then variable that are not present in the blueprint
will be set to OMIT.
- blueprint –
-
clinical_files
¶
-
filtered_variables
¶ Dictionary where {tmtk.VarID: tmtk.Variable} for all variables in the column mapping file that do not have a data label in the RESERVED_KEYWORDS list
-
find_variables_by_label
(label: str, in_file: str = None) → list[source]¶ Search for variables based on data label. All labels are converted to lower case.
Parameters: - label –
- in_file –
Returns:
-
get_datafile
(name: str)[source]¶ Find datafile object by filename.
Parameters: name – name of file. Returns: tmtk.DataFile object.
-
get_patients
()[source]¶ Creates a dictionary that has subject identifiers as keys and each value is a map that contains an nothing or an ‘age’ and/or ‘gender’ key that maps to this value.
Returns: patients dict.
-
get_trial_visits
()[source]¶ Returns a list of all trial visits present in this study. Visits are identified by the TRIAL_VISIT_LABEL keyword in column mapping and can be annotated with a value and unit using the TrialVisits object.
Returns: list of dicts.
-
get_variable
(var_id: tuple)[source]¶ Return a Variable object based on variable id.
Parameters: var_id – tuple of filename and column number. Returns: tmtk.Variable.
-
load_to
¶
-
params
¶
-
ColumnMapping¶
-
class
tmtk.clinical.
ColumnMapping
(params=None)[source]¶ Bases:
tmtk.utils.filebase.FileBase
,tmtk.utils.validate.ValidateMixin
Class with utilities for the column mapping file for clinical data. Can be initiated with by giving a clinical params file object.
-
RESERVED_KEYWORDS
= ('SUBJ_ID', 'START_DATE', 'END_DATE', 'MODIFIER', 'TRIAL_VISIT_LABEL', 'INSTANCE_NUM', 'DATA_LABEL', 'VISIT_NAME', 'SITE_ID', '\\', 'OMIT', 'PATIENT_VISIT')¶
-
append_from_datafile
(datafile)[source]¶ Appends the column mapping file with rows based on datafile column names.
Parameters: datafile – tmtk.DataFile object.
-
build_index
(df=None)[source]¶ Build index for the column mapping dataframe. If pd.DataFrame (optional) is given, modify and return that.
Parameters: df – pd.DataFrame. Returns: pd.DataFrame.
-
get_concept_path
(var_id: tuple)[source]¶ Return concept path for given variable identifier tuple.
Parameters: var_id – tuple of filename and column number. Return str: concept path for this variable.
-
ids
¶ A list of variable identifier tuples.
-
included_datafiles
¶ List of datafiles included in column mapping file.
-
path_changes
(silent=False)[source]¶ Determine changes made to column mapping file.
Parameters: silent – if True, only print output. Returns: if silent=False return dictionary with changes since load.
-
path_id_dict
¶ Dictionary with all variable ids as keys and paths as value.
-
select_row
(var_id: tuple)[source]¶ Select row based on variable identifier tuple. Raises exception if variable is not in this column mapping.
Parameters: var_id – tuple of filename and column number. Returns: list of items in selected row.
-
set_column_type
(var_id: tuple, value: str)[source]¶ Set variable to a given data type.
Parameters: - var_id – tuple of filename and column number.
- value – value to set column type to.
-
set_concept_code
(var_id: tuple, value)[source]¶ Set the concept code for a variable.
Parameters: - var_id – tuple of filename and column number.
- value – value to set concept code to.
-
set_concept_path
(var_id: tuple, path=None, label=None)[source]¶ Set the concept path or data label for given variable identifier tuple.
Parameters: - var_id – tuple of filename and column number.
- path – new value for path.
- label – new value for data label.
-
set_reference_column
(var_id: tuple, value)[source]¶ Set the reference column for a variable, this is used for modifiers to specify which columns are affected by this modifier variable.
Parameters: - var_id – tuple of filename and column number.
- value – value to set reference column to.
-
subj_id_columns
¶ A list of tuples with datafile and column index for SUBJ_ID, e.g. (‘cell-line.txt’, 1).
-
DataFile¶
-
class
tmtk.clinical.
DataFile
(path=None)[source]¶ Bases:
tmtk.utils.filebase.FileBase
Class for clinical data files, does not do much more than tmkt.FileBase.
Variable¶
-
class
tmtk.clinical.
Variable
(datafile, column: int = None, clinical_parent=None)[source]¶ Bases:
object
Base class for clinical variables
-
VIS_CATEGORICAL
= 'LAC'¶
-
VIS_DATE
= 'LAD'¶
-
VIS_NUMERIC
= 'LAN'¶
-
VIS_TEXT
= 'LAT'¶
-
category_code
¶ The second column of the column mapping file for this variable. This combines with self.data_label to create self.concept_path.
Returns: str.
-
column_map_data
¶ Column mapping row as dictionary where keys are short descriptors.
Returns: dict.
-
column_type
¶ Column data type setting can be found in modifiers file for MODIFIER vars, else it is in the DataType column of column mapping. If it is not found, it will be either numerical or categorical based on the datafile values.
-
concept_code
¶
-
concept_path
¶ Concept path after conversions by transmart-batch. Combination of self.category_code and self.data_label. Cannot be set.
Returns: str.
-
data_label
¶ Variable data label.
Returns: str.
-
end_date
¶
-
forced_categorical
¶ Check if forced categorical by entering ‘CATEGORICAL’ in data type column. Can be changed by setting this to True or False.
Returns: bool.
-
header
¶
-
is_empty
¶ Check if variable is fully empty.
Returns: bool.
-
is_in_wordmap
¶ Check if variable is represented in word mapping file.
Returns: bool.
-
is_numeric
¶ True if transmart-batch will load this concept as numerical. This includes information from word mapping and column mapping.
Returns: bool.
-
is_numeric_in_datafile
¶ True if the datafile contains only numerical items.
Returns: bool.
-
mapped_values
¶ Data items after word mapping.
Returns: list.
-
max
¶
-
min
¶
-
modifier_code
¶ Requires implementation, always returns ‘@’.
-
modifiers
¶ Returns a list of all modifier variable that apply to this variable. The data label for these variables have to be ‘MODIFIER’ and the fifth column (reference column) has to either be empty or the column this variable has.
Returns: list of modifier variables.
-
reference_column
¶
-
start_date
¶
-
subj_id
¶
-
trial_visit
¶
-
unique_values
¶ Returns: Unique set of values in the datafile.
-
values
¶ Returns: All values as found in the datafile.
-
var_id
¶ Returns: Variable identifier tuple (datafile.name, column).
-
visual_attributes
¶
-
word_map_dict
¶ A dictionary with word mapped categoricals. Keys are items in the datafile, values are what they will be mapped to through the word mapping file. Unmapped items are also added as key, value pair.
Returns: dict.
-
WordMapping¶
-
class
tmtk.clinical.
WordMapping
(params=None)[source]¶ Bases:
tmtk.utils.filebase.FileBase
,tmtk.utils.validate.ValidateMixin
Class representing the word mapping file.
-
build_index
(df=None)[source]¶ Build and sort multi-index for dataframe based on filename and column number columns. If no df parameter is not set, build index for self.df.
Parameters: df – pd.DataFrame. Returns: pd.DataFrame.
-
get_word_map
(var_id)[source]¶ Return dict with value in data file, and the mapped value as keyword-value pairs.
Parameters: var_id – tuple of filename and column number. Returns: dict.
-
included_datafiles
¶ List of datafiles included in word mapping file.
-
set_word_map
(var_id, d)[source]¶ Set the word mapping for specific variable based on its filename and column number.
Parameters: - var_id – variable identifier tuple.
- d – dictionary that contains the value map.
-
word_map_changes
(silent=False)[source]¶ Determine changes made to word mapping file.
Parameters: silent – if True, only print output. Returns: if silent=False return dictionary with changes since load.
-
word_map_dicts
¶ Dictionary with all variable ids as keys and word map dicts as value.
-
Annotations¶
Annotations Container¶
Base class: AnnotationBase¶
-
class
tmtk.annotation.AnnotationBase.
AnnotationBase
(params=None, path=None)[source]¶ Bases:
tmtk.utils.filebase.FileBase
,tmtk.utils.validate.ValidateMixin
Base class for annotation files.
-
load_to
¶
-
marker_type
¶
-
ChromosomalRegions¶
-
class
tmtk.annotation.ChromosomalRegions.
ChromosomalRegions
(params=None, path=None)[source]¶ Bases:
tmtk.annotation.AnnotationBase.AnnotationBase
Subclass for CNV (aCGh, qDNAseq) annotation
-
biomarkers
¶
-
MicroarrayAnnotation¶
-
class
tmtk.annotation.MicroarrayAnnotation.
MicroarrayAnnotation
(params=None, path=None)[source]¶ Bases:
tmtk.annotation.AnnotationBase.AnnotationBase
Subclass for microarray (mRNA) expression annotation files.
-
biomarkers
¶
-
MirnaAnnotation¶
-
class
tmtk.annotation.MirnaAnnotation.
MirnaAnnotation
(params=None, path=None)[source]¶ Bases:
tmtk.annotation.AnnotationBase.AnnotationBase
Subclass for micro RNA (miRNA) expression annotation files.
-
biomarkers
¶
-
ProteomicsAnnotation¶
-
class
tmtk.annotation.ProteomicsAnnotation.
ProteomicsAnnotation
(params=None, path=None)[source]¶ Bases:
tmtk.annotation.AnnotationBase.AnnotationBase
Subclass for proteomics annotation
-
biomarkers
¶
-
High Dimensional data¶
HighDim¶
-
class
tmtk.highdim.HighDim.
HighDim
(params_list=None, parent=None)[source]¶ Bases:
tmtk.utils.validate.ValidateMixin
Container class for all High Dimensional data types.
Parameters: params_list – contains a list with Params objects. -
high_dim_files
¶
-
sample_mapping_files
¶
-
HighDimBase¶
-
class
tmtk.highdim.HighDimBase.
HighDimBase
(params=None, path=None, parent=None)[source]¶ Bases:
tmtk.utils.filebase.FileBase
,tmtk.utils.validate.ValidateMixin
Base class for high dimensional data structures.
-
load_to
¶
-
CopyNumberVariation¶
-
class
tmtk.highdim.CopyNumberVariation.
CopyNumberVariation
(params=None, path=None, parent=None)[source]¶ Bases:
tmtk.highdim.HighDimBase.HighDimBase
Base class for copy number variation datatypes (aCGH, qDNAseq)
-
allowed_header
¶
-
samples
¶
-
Expression¶
-
class
tmtk.highdim.Expression.
Expression
(params=None, path=None, parent=None)[source]¶ Bases:
tmtk.highdim.HighDimBase.HighDimBase
Base class for microarray mRNA expression data.
-
samples
¶
-
Mirna¶
-
class
tmtk.highdim.Mirna.
Mirna
(params=None, path=None, parent=None)[source]¶ Bases:
tmtk.highdim.HighDimBase.HighDimBase
Base class for proteomics data.
-
samples
¶
-
Proteomics¶
-
class
tmtk.highdim.Proteomics.
Proteomics
(params=None, path=None, parent=None)[source]¶ Bases:
tmtk.highdim.HighDimBase.HighDimBase
Base class for proteomics data.
-
samples
¶
-
ReadCounts¶
SampleMapping¶
-
class
tmtk.highdim.SampleMapping.
SampleMapping
(path=None)[source]¶ Bases:
tmtk.utils.filebase.FileBase
,tmtk.utils.validate.ValidateMixin
Base class for subject sample mapping
-
get_concept_paths
¶ Get all concept paths from file, replaces ATTR1 and ATTR2.
Returns: dictionary with md5 hash values as key and paths as value
-
platform
¶ Returns: the platform id in this sample mapping file.
-
samples
¶
-
slice_path
(path)[source]¶ Give slice of the dataframe where the paths are equal to given path. :param path: path (will be converted using global logic). :return: slice of dataframe.
-
study_id
¶ Returns: study_id in sample mapping file
-
Metadata Tags¶
Tags¶
Bases:
tmtk.utils.filebase.FileBase
,tmtk.utils.validate.ValidateMixin
Add metadata tags from a blueprint object.
Parameters: blueprint – blueprint object.
generator that gets tags from tags file.
Returns: tuples (<path>, <title>, <description>)
Return tag paths delimited by the path_converter.
Utilities¶
FileBase¶
Generic module¶
-
tmtk.utils.Generic.
clean_for_namespace
(path) → str[source]¶ Converts a path and returns a namespace safe variant. Converts characters that give errors to underscore.
Parameters: path – usually a descriptive subdirectory Returns: string
-
tmtk.utils.Generic.
df2file
(df=None, path=None, overwrite=False)[source]¶ Write a dataframe to file safely. Does not overwrite existing files automatically. This function converts concept path delimiters.
Parameters: - df – pd.DataFrame
- path – path to write to
- overwrite – False (default) or True
-
tmtk.utils.Generic.
file2df
(path=None)[source]¶ Load a file specified by path into a Pandas dataframe. If hashed is True, return a a (dataframe, hash) value tuple.
Parameters: path – to file to load Returns: pd.DataFrame
-
tmtk.utils.Generic.
find_fully_unique_columns
(df)[source]¶ Check if a dataframe contains a fully unique column (SUBJ_ID candidate).
Parameters: df – pd.DataFrame Returns: list of names of unique columns
-
tmtk.utils.Generic.
fix_everything
()[source]¶ Scans over all the data and indicates which errors have been fixed. This function is great for stress relieve.
Returns: All your problems fixed by Rick
-
tmtk.utils.Generic.
is_not_a_value
(value)[source]¶ Returns whether value is None, pd.np.nan, or an empty string
-
tmtk.utils.Generic.
md5
(s: str)[source]¶ utf-8 encoded md5 hash string of input s.
Parameters: s – string Returns: md5 hash string
-
tmtk.utils.Generic.
merge_two_dicts
(x, y)[source]¶ Given two dicts, merge them into a new dict as a shallow copy.
-
tmtk.utils.Generic.
path_converter
(path, to_internal=False, from_internal=False)[source]¶ Convert paths by creating delimiters of backslash “” and “+” sign, additionally converting underscores “_” to a single space.
Parameters: - path – concept path
- to_internal – if path is for internal use delimit with Mappings.PATH_DELIM
- from_internal – replace + and _ with escaped versions.
Returns: delimited path
-
tmtk.utils.Generic.
path_join
(*args)[source]¶ Join items with the used path delimiter.
Parameters: args – path items Returns: path as string
-
tmtk.utils.Generic.
summarise
(list_or_dict=None, max_items: int = 7) → str[source]¶ Takes an iterable and returns a summarized string statement. Picks a random sample if number of items > max_items.
Parameters: - list_or_dict – list or dict to summarise
- max_items – maximum number of items to keep.
Returns: the items joined as string with end statement.
utils.CPrint module¶
utils.Exceptions module¶
-
exception
tmtk.utils.Exceptions.
ClassError
(found=None, expected=None)[source]¶ Bases:
Exception
Error raised when unexpected class is found.
Parameters: - found – is the Object class of found
- expected – is the required Object class
-
exception
tmtk.utils.Exceptions.
DatatypeError
(found=None, expected=None)[source]¶ Bases:
Exception
Error raised when incorrect datatype is found.
Parameters: - found – is the datatype of object
- expected – is the required datatype
utils.HighDimUtils module¶
utils.mappings module¶
-
class
tmtk.utils.mappings.
Mappings
[source]¶ Bases:
object
Collection of statics used in various parts of the code.
-
EXT_PATH_DELIM
= '\\'¶
-
PATH_DELIM
= '∕'¶
-
ancestors
= 'Ancestors'¶
-
annotation_data_types
= {'expression': 'Messenger RNA data (microarray)', 'cnv': 'ACGH data', 'proteomics': 'Proteomics data (mass spec)', 'vcf': 'Genomic variant data', 'mirna': 'micro RNA data (PCR)', 'rnaseq': 'Messenger RNA data (sequencing)'}¶
-
annotation_marker_types
= {'proteomics_annotation': 'PROTEOMICS', 'rnaseq_annotation': 'RNASEQ_RCNT', 'cnv_annotation': 'Chromosomal', 'vcf_annotation': 'VCF', 'mirna_annotation': 'MIRNA_QPCR', 'mrna_annotation': 'Gene expression'}¶
-
blob
= 'blob'¶
-
cat_cd
= 'Category Code'¶
-
cat_cd_s
= 'ccd'¶
-
col_num
= 'Column Number'¶
-
col_num_s
= 'col'¶
-
column_mapping_header
= ['Filename', 'Category Code', 'Column Number', 'Data Label', 'Magic 5th', 'Ontology Code', 'Data Type']¶
-
column_mapping_s
= ['fn', 'ccd', 'col', 'dl', 'm5', 'm6', 'cty']¶
-
column_type
= 'Data Type'¶
-
concept_type
= 'Data Type'¶
-
concept_type_s
= 'cty'¶
-
data_label
= 'Data Label'¶
-
data_label_s
= 'dl'¶
-
df_value
= 'Datafile Value'¶
-
df_value_s
= 'dfv'¶
-
filename
= 'Filename'¶
-
filename_s
= 'fn'¶
-
static
get_annotations
(dtype=None)[source]¶ Return mapping for annotations classes. Return only for datatype if dtype is set. Else return full map.
Parameters: dtype – optional datatype (e.g. cnv_annotation) Returns: dict with mapping, or class.
-
static
get_highdim
(dtype=None)[source]¶ Return mapping for high dimensional classes. Return only for datatype if dtype is set. Else return full map.
Parameters: dtype – optional datatype (e.g. cnv) Returns: dict with mapping, or class.
-
static
get_params
(dtype=None)[source]¶ Return mapping for params classes. Return only for datatype if dtype is set. Else return full map.
Parameters: dtype – optional datatype (e.g. cnv) Returns: dict with mapping, or class.
-
magic_5
= 'Magic 5th'¶
-
magic_5_s
= 'm5'¶
-
magic_6_s
= 'm6'¶
-
map_value
= 'Mapping Value'¶
-
map_value_s
= 'map'¶
-
modifier_cd
= 'modifier_cd'¶
-
modifier_path
= 'modifier_path'¶
-
modifiers_header
= ['modifier_path', 'modifier_cd', 'name_char', 'Data Type']¶
-
name_char
= 'name_char'¶
-
ontology_code
= 'Ontology Code'¶
-
ontology_header
= ['Ontology Code', 'Label', 'Ancestors', 'blob']¶
-
term_label
= 'Label'¶
-
trial_visits_header
= ['name', 'relative_time', 'time_unit']¶
-
tv_label
= 'name'¶
-
tv_unit
= 'time_unit'¶
-
tv_value
= 'relative_time'¶
-
word_mapping_header
= ['Filename', 'Column Number', 'Datafile Value', 'Mapping Value']¶
-
Toolbox package¶
Generate chromosomal regions file¶
-
tmtk.toolbox.generate_chromosomal_regions_file.
generate_chromosomal_regions_file
(platform_id=None, reference_build='hg19', **kwargs)[source]¶ This creates a new chromosomal regions annotation file.
Parameters: - platform_id – Give the new platform a name to fill first column
- reference_build – choose either hg18, hg19 or hg38
Returns: a pandas dataframe with the new platform
Remap chromosomal regions data¶
-
tmtk.toolbox.remap_chromosomal_regions.
map_index_to_region_ids
(gene, origin_platform, region_origin)[source]¶
Study Wizard¶
Create study from templates¶
-
tmtk.toolbox.
create_study_from_templates
(ID, source_dir, output_dir=None, sec_req='Y')[source]¶ Create tranSMART files in designated output_dir for all data provided in templates in the source_dir.
Parameters: - ID – study ID.
- source_dir – directory containing all the templates.
- output_dir – directory where the output should be written.
- sec_req – security required? “Y” or “N”, default=”Y”.
Returns: None
The Arborist¶
tmtk.arborist.common module¶
-
tmtk.arborist.common.
call_boris
(study=None, **kwargs)[source]¶ This function loads the Arborist if it has been properly installed in your environment.
Parameters: study – a <tmtk.Study> object.
tmtk.arborist.connect_to_baas module¶
-
tmtk.arborist.connect_to_baas.
get_json_from_baas
(url, username=None)[source]¶ Get a json file from a Boris as a Service instance.
Parameters: - url – url should study name and version. (e.g. http://transmart-arborist.thehyve.nl/trees/study-name/1/~edit/).
- username – if no username is given, you will be prompted for one.
Returns: the JSON string from BaaS.
-
tmtk.arborist.connect_to_baas.
publish_to_baas
(url, json, study_name, username=None)[source]¶ Publishes a tree on a Boris as a Service instance.
Parameters: - url – url to a BaaS instance.
- json – the stringified json you want to publish.
- study_name – a nice name.
- username – if no username is given, you will be prompted for one.
Returns: the url that points to the study you’ve just uploaded.
tmtk.arborist.jstreecontrol module¶
-
class
tmtk.arborist.jstreecontrol.
ConceptNode
(path, var_id=None, node_type='numeric', data_args=None)[source]¶ Bases:
object
-
class
tmtk.arborist.jstreecontrol.
ConceptTree
(json_data=None)[source]¶ Bases:
object
Build a ConceptTree to be used in the graphical tree editor.
-
add_node
(path, var_id=None, node_type=None, data_args=None)[source]¶ Add ConceptNode object nodes list.
Parameters: - path – Concept path for this node.
- var_id – Unique ID that allows to keep track of a node.
- node_type – Explicitly set node type (highdim, numerical, categorical)
- data_args – Any additional parameters are put a ‘data’ dictionary.
-
column_mapping_file
¶ Returns: Column Mapping file based on ConceptTree object.
-
high_dim_paths
¶ All high dimensional nodes in concept tree as dict
-
jstree
¶
-
word_mapping
¶
-
-
class
tmtk.arborist.jstreecontrol.
JSNode
(path, oid=None, **kwargs)[source]¶ Bases:
object
This class exists as a helper to the JSTree. Its “json_data” method can generate sub-tree JSON without putting the logic directly into the JSTree.
-
class
tmtk.arborist.jstreecontrol.
JSTree
(concept_nodes)[source]¶ Bases:
object
An json like object that converts a list of nodes into something that jQuery jstree can use.
-
json_data
¶ Convert this object to json ready to be consumed by jstree.
-
json_data_string
¶ Returns: Returns the json_data properly formatted as string.
-
-
class
tmtk.arborist.jstreecontrol.
MyEncoder
(skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶ Bases:
json.encoder.JSONEncoder
Overwriting the standard JSON Encoder to treat numpy ints as native ints.
-
tmtk.arborist.jstreecontrol.
create_concept_tree
(column_object)[source]¶ Parameters: column_object – tmtk.Study object, tmtk.Clinical object, or ColumnMapping dataframe Returns: json string to be interpreted by the JSTree
Data templates¶
This document describes how you can use tmtk to read your filled in templates and write the data to tranSMART-ready files. The templates can be downloaded here.
Create study templates¶
Using the tmtk.toolbox.create_study_from_templates()
function you
can process any template you have filled in, and output the contents
to a format that can be uploaded to tranSMART. It has the following parameters:
ID
(Mandatory) Unique identifier of the study. This argument does not define the name of the study, that will be derived fromLevel 1
of the clinical data template tree sheet.source_dir
(Mandatory) Path to the folder in which the filled in templates are stored. Template files are not searched recursively, so all should be in the same folder.output_dir
Path to the folder where the tranSMART files should we written to. If the path doesn’t exist the required folder(s) will be created. Default:./<STUDY_ID>_transmart_files
sec_req
Determines whether it should be a public or private study. UseY
for private orN
for public. Default:Y
It is important that your source_dir
contains just one clinical data template, which is detected
by having “clinical” somewhere in the file name (case insensitive). If the template with general
study level metadata is present it should have “general study metadata” in its name (case insensitive).
All high-dimensional templates are detected by content, so file names are not important, as long as the
names don’t conflict with the templates described above.
Note: It is possible to run the function with only high-dimensional templates, but keep in mind that in that case the concept paths will have to be manually added to the subject-sample mapping files.
# Load the toolkit
import tmtk
# Read templates and write to tranSMART files
tmtk.toolbox.create_study_from_templates(ID='MY-TEMPLATE-STUDY',
source_dir='./my_templates_folder/',
sec_req='N')
Contributors¶
- Wibo Pipping
- Stefan Payrable
- Ward Weistra