GDSCTools documentation

Current version: 1.0.1, Jan 09, 2018

https://badge.fury.io/py/gdsctools.svg https://secure.travis-ci.org/CancerRxGene/gdsctools.png https://coveralls.io/repos/CancerRxGene/gdsctools/badge.svg?branch=master&service=github Documentation Status
Citation:Cokelaer et al. GDSCTools for mining pharmacogenomic interactions in cancer Bioinformatics, 2017, https://doi.org/10.1093/bioinformatics/btx744
Note:developed and tested for Python 2.7, 3.4, 3.5
Note:The GDSCTools libary works for Python 2.7 and 3.5 but the standalone pipeline to be ran on cluster works on Python 3.5 only
Contributions:Please join https://github.com/CancerRxGene/gdsctools project
Documentation:On ReadTheDocs
GitHub:On github

GDSCTools is a free open-source Python library dedicated to the study of drug responses in the context of the GDSC (Genomics of Drug Sensitivity in Cancer) project. The main developer is Thomas Cokelaer (Institut Pasteur), and it is a joint effort of the groups of Mathew Garnett (Sanger Institute) and Julio Saez-Rodriguez (RWTH Aachen & EMBL-EBI).

It contains utilities to find significant associations between drugs and genomic features (e.g., gene mutation) based on an ANOVA analysis. Other methods, such as multi-factorial linear models based on Elastic Net are also available. Besides, the library should also be useful for manipulating dedicated data sets such as IC50 (drug response) or MoBEM (genomic features) data structures. Hence, we hope that GDSCTools serves as basis for other scientists to develop further methods.

First Steps

Get started with GDSCTools

Examples

Visit our example gallery

The ANOVA analysis

Browse the full documentation

GDSCTools is written in Python. If you are a developer and/or knows already about the Python ecosystem and the pip command, just type the following command in a Terminal to install GDSCTools:

pip install gdsctools

add the option --upgrade to get the latest release. Conversely, if you are not familiar with Python or the command above, please see the Installation section for further details. Note also that we strongly recommend to use Anaconda to install dependencies (e.g., numpy, matplotlib); GDSCTools is available on bioconda channel:

conda install gdsctools.

In the following example, we provide a short Python snippet that uses the GDSCTools library. You can either copy and paste the code in a file, and execute it or type the commands in an IPython shell. With this example we investigate the associations between the IC50 of a given drug (across 52 breast cancer cell lines) and a genomic feature (here, TP53 mutation). Drugs are refer to by a unique identifier (here 1047):

from gdsctools import ANOVA, ic50_test
gdsc = ANOVA(ic50_test)
gdsc.set_cancer_type('breast')
df = gdsc.anova_one_drug_one_feature(1047, 'TP53_mut', show=True)

(Source code, png, hires.png, pdf)

_images/index-1.png

The df object returned in the last statement is a dataframe that contains information explained in Regression analysis section.

See also

For more examples and explanations, please visit the ANOVA analysis (introduction) section.

The previous example may be verbose with comments and warnings. You may set the verbose option to False and ignore warnings as follows:

import warnings
warnings.simplefilter("ignore","exceptions.Warning")
gdsc = ANOVA(ic50_test, verbose=False)

We will see more examples on how to use GDSCTools to perform more systematic studies. However, let us note that GDSCTools also provide a standalone application called gdsctools_anova, which can be used within a standard Terminal (same output as in the previous example):

gdsctools_anova --input-ic50 <ic50 filename> --drug 1047
    --feature TP53_mut

If you want to have a go, please download this IC50 example, which is required as an input. Note that by default, GDSCTools loads a set of 50 genomic features and 1001 cell lines but in general, you should provide your own genomic feature file (see Data Format and Readers). The default data set contains only a small set of genomic features and can be downloaded: GenomicFeature example, and adapted to your needs.

See also

See Standalone application section for more details about the standalone application and the Data Format and Readers section to learn more about the expected input data formats.

Contents

Installation

Note

For those insterested, as from Oct 11 2017, we provide a Signularity container.

You know bioconda, then you can install GDSCTools as follows:

conda install gdsctools

Otherwise, please read this section in details.

GDSCTools is written in Python and depends on a set of established scientific libraries such as Matplotlib (visualisation) and Pandas (data manipulation) to cite just a few. We post releases on the Python repository and make use of the Python ecosystem to provide a robust software. Would you have any troubles, please see the FAQS or fill an issue/ticket on github.

Installation using Conda

Install conda executable

In practice, we do use Anaconda . We recommend to install conda executable via the manual installer (download <https//continuum.io/downloads>_). You may have the choice between Python 2 and 3. We recommend to choose a Python version 3.

Add conda channels

When you want to install a new package, you have to use this type of syntax:

conda install ipython

where ipython is the package you wish to install. Note that by default, conda looks on the official Anaconda website (channel). However, there are many channels available. We will use the bioconda channel. To use it, type these commands (once for all):

conda config --add channels r
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

Warning

it is important to add them in this order, as mentionned on bioconda webpage (https://bioconda.github.io/).

Create an environment

Once conda is installed, open a new shell. Although this is not required strictly speaking, we would recomment to create an environment dedicated to Sequana. This environment can later be removed without affecting your system or conda installation. A conda environment is nothing else than a directory and can be created as follows:

conda create --name gdsctools_env python=3.5

Then, since you may have several environments, you must activate the gdsctools environment itself:

source activate gdsctools_env

Install gdsctools via conda (bioconda)

Finally, just type:

conda install gdsctools

This should install most of the required dependencies. However, you may need to install more packages depending on the pipeline used.

Installation using pip

Python users and developers

Releases of GDSCTools are available on Pypi so GDSCTools can be installed in a Terminal using the pip command:

pip install gdsctools

Dependencies (e.g., Pandas, Matplotlib) should be taken care of automatically. However, compilation are needed making the installation process much longer.

Installation from source

git clone https://github.com/CancerRxGene/gdsctools
cd github_gdsctools
git pull
python setup.py install

Testing your installation

You should now be ready to use GDSCTools. A good test is to check that the following executable is available. In a shell, type:

gdsctools_anova --test

or

gdsctools_anova --help

or for developers, start an IPython shell, and type for example:

from gdsctools import *
an = ANOVA(ic50_test)

Please, go to the next section for a Quick Start session.

Open an IPython shell

Under Windows, got to All Programs–>Anaconda –>Anaconda Prompt.

A shell will be opened where you can type ipython command.

Or alternatively, under Windows, got to All Programs–>Anaconda –>IPython

Notes for windows/mac/linux

The Anaconda method was tested successfully on the following systems: MAC, Windows 7 Pack1, Fedora 19 (Nov 2015) with version 0.16.5 of GDSCTools.

Under Windows, an error was raised due to scipy. This was fixed by typing:

conda remove scipy scikit-learn -y
conda install scipy scikit-learn -y

https://github.com/scikit-learn/scikit-learn/issues/4830

Testing or Production with a Singularity container

We provide a Singularity image on https://singularity-hub.org/collections/480/ . This container contains GDSCTools software and all its dependencies. The plotting may not work in an interactive way for Mac or Windows users. The main reason being that under Mac and windows a virtualbox is used by Singularity preventing a X connection. This should be solved in the near future.

First, install singularity (http://singularity.lbl.gov/). Second, download a Sequana image. For instance, for the latest master version:

singularity pull shub://CancerRxGene/gdsctools:release_1_0_0

Do not interrupt the download (2.5Go). Once downloaded, you can use, for instance, the gdsctools_anova executable:

singularity exec gdsctools-release_1_0_0.img gdsctools_anova --help

Would you miss a dependency, just enter into the singularity container and install the missing dependencies. You will need writable permission:

sudo singularity shell -w gdsctools-release_1_0_0.img

Then, inside the container, install or fix the problem and type exit to save the container.

Quick Start

Data

In the following examples and most of the examples in GDSCTools, we use data sets known as version 5 (v5) and version 17 (v17). Each set is made of a drug responses file (IC50s) and a genomic features file. The 4 files are copies of original files to be found on the CancerRxGene page. Altough not required for the following examples, you may want to download them directly as follows:

wget https://tinyurl.com/y7nn6e5h -O genomic_features_v17.csv.gz
wget https://tinyurl.com/ybtlsrpz -O genomic_features_v5.csv.gz
wget https://tinyurl.com/ycavjd37 -O IC50_v17.csv.gz
wget https://tinyurl.com/yakfnqmb -O IC50_v5.csv.gz

ANOVA analysis

If you already know what you can do with GDSCTools, we assume you have a well formatted IC50 matrix and a genomic features binary matrix. Then, you can run the entire ANOVA analysis as follows:

from gdsctools import ANOVA
# For example, use these test files
# from gdsctools import ic50_test as ic50_filename
# from gdsctools import gf_v17 as genomic_feature_filename
gdsc = ANOVA(IC50_filename, genomic_feature_filename)
results = gdsc.anova_all()

And create an HTML report as follows:

from gdsctools import ANOVAReport
report = ANOVAReport(gdsc, results)
report.create_html_pages()

The results variable contains all tested associations within a single dataframe. The report will focus on significant associations and create boxplots or volcano plots accordingly.

More details about the ANOVA analysis itself can be found in the ANOVA analysis (introduction) and The ANOVA analysis in details sections. The data structure can be found in the next section (Data Format and Readers).

Regression analysis

Similarly, for the regression analysis, one can write a script as above. Here, we restrict the analysis to the first 4 drugs (figures are open for each drug):

from gdsctools import GDSCLasso
lasso = GDSCLasso(IC50_filename, genomic_feature_filename)
for drugid in lasso.drugIds[0:4]:
    res = lasso.runCV(drugid, kfolds=8)
    best_model = lasso.get_model(alpha=res.alpha)
    #weights = lasso.plot_weight(drugid, best_model)
    boxplots = lasso.boxplot(drugid, model=best_model, n=10, bx_vert=False)

However, we would recommend to use a worflow designed for this analysis. If you type this command in a shell:

gdsctools_regression -I IC50_filename -F genomic_feature_filename
    --method lasso -o analysis
cd analysis

it creates a Snakemake pipeline and its configuration file. You can then edit file named config.yaml and once done, execute the pipeline:

snakemake -s regression.rules

Warning

you must install Snakemake, in which case you must use Python>=3.5 (conda install snakemake)

See Regression analysis section for details.

Data Format and Readers

CSV and TSV formats

The main formats used in GDSCTools are CSV-based but TSV-based formatted files may be accepted although not encouraged. The data files may be compressed using bz2, xz, gzip of zip methods. See Pandas documentation to know the exact list of authorised compression method. Note that files saved using GDSCTools will be saved in CSV format only.

GDSCTools provides tools to read different kind of structured CSV files. For instance in the ANOVA analysis, these 3 types of CSV-input files defined in readers module) are used:

There are all based on the same Reader class. Not a number values are encoded with the string NA or NaN. Besides, empty strings or fields made of spaces or tabs are also replaced by spaces.

Quote characters are also removed.

Warning

Some readers will use the name of the extensions to infer the separator so it is important that the extension reflects the content of the file. A compressed file named as e.g. ic50.csv.gz will therefore be interpreted as a CSV file.

Note

With CSV files, if a column’s name is ambiguous in the header (i.e, contains already a comma), then it should be enclosed within double quotes to avoid ambiguities (e.g A,B should be “A,B”). See the Drug Decode section for an example.

IC50

The specific format for the IC50 data is CSV file where the header must contain a column named COSMIC_ID. Other column should correspond to a given drug identifier (an integer). The order of the columns does not matter. So, each row contains the IC50s for a given COSMIC identifier.

Warning

the data read in the CSV file is not transformed. So, IC50 data should be logged data.

Note

The IC50 matrix can be populated with other data (e.g., AUCs).

Although the name in the header does not matter note that if one column’s name starts with Drug then all columns that do not start with Drug are ignored (except the special column COSMIC_ID). This feature was implemented to account for old data files that stores all Drug identifiers and also all genomic features within a single file.

Here is an example of IC50 input:

COSMIC_ID, 1,           20,           40
111111,    0.5,         0.8,          10
222222,    1,           2,            10

The following old-style IC50 would be equivalent since (i) the last column is dropped (does not start with Drug) and (ii) the other column’s names are transformed into integer keeping only the middle part:

COSMIC_ID, Drug_1_IC50, Drug_20_IC50, Drug_40_IC50, other
111111,    0.5,         0.8,          10,           10
222222,    1,           2,            10,           20

If you save that example in a file (or download it _static/ic50_tiny.csv), you can read it with the IC50 class as follows:

>>> from gdsctools import IC50
>>> r = IC50('_static/ic50_tiny.csv')
>>> r.drugIds
[1, 20, 40]

Note

the columns’ names should be identifiers (not drug names). There are two main reasons. The first one is that it allows us to keep anonymous all drug names and targets. The second reason is that many characteristics such as plate number and drug concentration may be associated with a drug identifier. This should be stored in a different table rather than in the name. It can then be handled and interpreted using the DrugDecode file (see below).

Note

column without a name are ignored.

See also

developers should look at the references for more functionalities of the IC50 class (e.g., filter by tissues, removing drugs, visualisation of IC50s).

Genomic Features

The ANOVA analysis computes the associations between the IC50 and genomic features. This is the second input data set required for instance in the ANOVA analysis. Be aware that in the ANOVA analysis, the intersection between the IC50 and GenomicFeatures is made on the COSMIC_ID: cell lines not found in both CSV files will be dropped.

In addition to the COSMIC identifiers, the genomic feature file should contain the following columns:

- TISSUE_FACTOR
- MSI_FACTOR
- MEDIA_FACTOR

If not provided, the tissue, MSI and MEDIA factors will not be taken into account in the regression analysis. If the TCGA tissue is not provided, it is created and set to unidentified.

Note

Changed in version 0.9.11: A column called ‘Sample Name’ was interpreted if found. This is not the case anymore. It is actually removed now.

All remaining columns are assumed to be genomic features.

Warning

In the current version, all columns starting with Drug_ are removed without warning.

Here is a simple example:

COSMIC_ID, TISSUE_FACTOR, MSI_FACTOR, BRAF_mut, gain_cna
111111, lung_NSCLC,  1, 1, 0
222222, prostate,    1, 0, 1

It can be saved and read as follows with the GenomicFeatures

>>> from gdsctools import GenomicFeatures
>>> gf = GenomicFeatures('_static/gf_tiny.csv')
>>> gf
GenomicFeatures <Nc=2, Nf=2, Nt=2>

In GDSCTools, we provide a zipped Genomic Features file. It contains about 1000 cell lines and 47 genomic features (gene mutations). A more complex file tagged v17 is also provided with about 600 features v17 genomic feature.

Note that you may create instance of GenomicFeatures without input but a default data set is loaded (the subset aforementionned):

>>> from gdsctools import GenomicFeatures
>>> gf = GenomicFeatures()
>>> print(gf)
Genomic features distribution
Number of unique tissues 27
Here are the first 10 tissues: myeloma, nervous_system, soft_tissue, bone, lung_NSCLC, skin, Bladder, cervix, lung_SCLC, lung
MSI column: yes
MEDIA column: no

There are 47 unique features distributed as
- Mutation: 47
- CNA (gain): 0
- CNA (loss): 0

Combine IC50 and Genomic Features

Here is an example on how to plot histograms of IC50s grouped by tissues. For convenience, we keep only 9 tissues.

from gdsctools import *
from numpy import mean
ic50 = IC50(ic50_v17)
gf = GenomicFeatures(gf_v17)
# select tissue column in same order as those stored in IC50 dataframe
tissues = gf.df.loc[ic50.df.index]['TISSUE_FACTOR']
ic50.df['tissue'] = tissues

# keep only 9 tissues
tokeep = list(set(tissues))[0:9]
ic50.df = ic50.df.query("tissue in @tokeep")

# Group by tissues
tt = ic50.df.groupby("tissue").aggregate(mean).transpose()
#plot histogram of IC50 group by tissues
tt.hist(bins=30, sharex=True)

(Source code, png, hires.png, pdf)

_images/data-1.png

Drug Decode

DrugDecode files are not required to perform the analysis. You may skip that section.

Drugs used in GDSCTools analysis may be public or not. In order to guarantee that drugs are kept anonymised (if not public), we enforce the CSV files that contains the IC50s to used drug identifiers instead of drug names.

When creating reports, the Data Packages producer or owner or the drugs may want to decode the drug identifier. The information to perform that task is provided within the DrugDecode CSV file.

The DrugDecode class reads a CSV file that contains information about a drug and its target(s). It must contain 3 columns named as follows:

DRUG_ID,    DRUG_NAME,  DRUG_TARGET
999,        Erlotinib,  EGFR
1039,       SL 0101-1,  "RSK, AURKB, PIM3"

Note the usage of quotes in the last row/last columns to avoid conflicts with the CSV format itself.

These columns will be used if provided:

- WEBRELEASE
- OWNED_BY

In addition, these columns may be populated for later use:

- CHEMSPIDER_ID
- PUBCHEM_ID
- CHEMBL_ID

An example can be read as follows:

>>> from gdsctools import DrugDecode, datasets
>>> drug_filename = datasets.testing.drug_test_csv.location
>>> dd = DrugDecode(drug_filename)
>>> dd.get_name(1047)
'Nutlin-3a'
>>> dd.df.loc[999]
CHEMBL_ID              NaN
CHEMSPIDER_ID          NaN
DRUG_NAME        Erlotinib
DRUG_TARGET           EGFR
OWNED_BY               NaN
PUBCHEM_ID             NaN
WEBRELEASE             NaN
Name: 999, dtype: object

DrugDecode files are not required for the analysis but are used by gdsctools.anova_report.ANOVAReport to fill the HTML reports.

You can also run the analysis and set the drug names and target later on as follows using the drug_annotations method:

from gdsctools import *
an = ANOVA(ic50_test)
an.anova_all()
results = an.anova_all()
dd = DrugDecode("v19_drug_decode.csv")
newdf = dd.drug_annotations(results.df)

ANOVA analysis (introduction)

Although we provide a Standalone application called gdsctools_anova, the most up-to-date and flexible way of using GDSCTools is to use the library from an IPython shell. This method (using IPython shell) is also the only way to produce Data Packages.

In this section we will exclusively use Python commands, which should also be of interest if you want to look at the Notebooks section later.

We assume now that (i) you have GDSCtools installed together with IPython. If not, please go back to the Installation section. (ii) you are familiar with the INPUT data sets that will be used hereafter.

Note

Beginners usually enter in an Python shell (typing python). Instead, we strongly recommend to use IPython, which is a more flexible and interactive shell. To start IPython, just type this command in a terminal:

ipython

You should now see something like:

Python 2.7.5 (default, Nov  3 2014, 14:33:39)
Type "copyright", "credits" or "license" for more information.

IPython 4.0.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]:

Note

All snippets in this documentation are typed within IPython shell. You may see >>> signs. They indicate a python statement typed in a shell. Lines without those signs indicate the output of the previous statement. For instance:

>>> a = 3
>>> 2 + a
5

means the code 2 + a should print the value 5

The IC50 input data

Before starting, we first need to get an IC50 data set example. Let us use this IC50 example test file.

See also

More details about the data format can be found in the Data Format and Readers section as well as links to retrieve IC50 data sets.

In Python, one can easily import all functionalities available in GDSCTools as follows:

from gdsctools import *

Although this syntax is correct, in the following we will try to be more explicit. So, we would rather use:

from gdsctools import IC50

This is better coding practice and has also the advantage of telling beginners which functions are going to be used.

Here above, we imported the IC50 class that is required to read the example file aforementioned. Note that this IC50 example is installed with GDSCTools and its location can be obtained using:

from gdsctools import ic50_test
print(ic50_test.filename)

The IC50 class is flexible enough that you can provide the filename location or just the name ic50_test as in the example below, and of course the filename of a local file would work as well:

>>> from gdsctools import IC50, ic50_test
>>> ic = IC50(ic50_test)
>>> print(ic)
Number of drugs: 11
Number of cell lines: 988
Percentage of NA 0.206569746043

As you can see you can get some information about the IC50 content (e.g., number of drugs, percentage of NaNs) using the The print statement function. See gdsctools.readers.IC50 and Data Format and Readers for more details.

Getting help

At any time, you can get help about a GDSCTools functionality or a python function by adding question tag after a function’s name:

IC50?

With GDSCTools, we also provide a convenience function called gsdctools_help():

gdsctools_help(IC50)

that should open a new tab in a browser redirecting you to the HTML help version (on ReadTheDoc website) of a function or class (here the IC50 class).

The ANOVA class

One of the main application of GDSCTools is based on an ANOVA analysis that is used to identify significant associations between drug and genomic features. As mentionned above, a first file that contains the IC50s is required. That file contains experimentall measured IC50s for a set of N_d drugs across N_c cell lines. The second data file is a binary file that contains various features across the same cell lines. Those N_f features are usually of genomic types (e.g., mutation, CNA, Methylation). A default set of about 50 genomic features is provided and automatically fetched in the following examples. You may also provide your own data set as an input.

The default genomic feature file is downloadable and its location can be found using:

from gdsctools import datasets
gf =  datasets.genomic_features

See also

More details about the genomic features data format can be found in the Data Format and Readers section.

The class of interest ANOVA takes as input a compulsary IC50 filename (or data) and possibly a genomic features filename (or data). Using the previous IC50 test example, we can create an ANOVA instance as follows:

from gdsctools import ANOVA, ic50_test
an = ANOVA(ic50_test)

Again, note that the genomic features is not provided, so the default file aforementionned will be used except if you provide a specific genomic features file as the second argument:

an = ANOVA(ic50_test, "your_genomic_features.csv")

There are now several possible analysis but the core of the analysis consists in taking One Drug and One Feature (ODOF hereafter) and to compute the association using a regression analysis (see Regression analysis for details). The IC50 across the cell lines being the dependent variable Y and the explanatory variables denoted X are made of tissues, MSI and one genomic feature. Following the regression analysis, we compute the ANOVA summary leading to a p-value for the significance of the association between the drug’s IC50s and the genomic feature considered. This calculation is performed with the anova_one_drug_one_feature() method.

We will see a concrete example in a minute. Once an ODOF is computed, one can actually repeat the ODOF analysis for a given drug across all features using the anova_one_drug() method. This is also named One Drug All Feature case (ODAF). Finally we can even extend the analysis to All Drugs All Features (ADAF) using anova_all().

The following image illustrates how those 3 methods interweave together like Russian dolls.

_images/layout1.png

The computational time is therefore increasing with the number of drugs and features. Let us now perform the analysis for the 3 different cases.

One Drug One Feature (ODOF)

Let us start with the first case (ODOF). User needs to provide a drug and a feature name and to call the anova_one_drug_one_feature() method. Here is an example:

from gdsctools import ANOVA, ic50_test
gdsc = ANOVA(ic50_test)
gdsc.anova_one_drug_one_feature(1047, 'TP53_mut', show=True, fontsize=16)

(Source code)

Setting the show parameter to True, we created a set of 3 boxplots that is one for each explanatory feature considered: tissue, MSI and genomic feature.

If there is only one tissue, this factor is included in the explanatory variable is not used (and the corresponding boxplot not produced). Similarly, the MSI factor may be ignored if irrelevant.

In the first boxplot, the feature factor is considered; we see the IC50s being divided in two populations (negative and positive features) where all tissues are mixed.

In the second boxplot, the tissue variable is explored; this is a decomposition of the first boxplot across tissues.

Finally, the third boxplot shows the impact of the MSI factor. Here again, all tissues are mixed. In the MSI column, zeros and ones correspond to MSI unstable and stab le, respetively. The pos and neg labels correspond to the feature being true or not, respetively.

The output of an ODOF analysis is a time series that contains statistical information about the association found between the drug and the feature. See for gdsctools.anova_results.ANOVAResults for more details.

If you want to repeat this analysis for all features for the drug 1047, you will need to know the feature names. This is stored in the following attribute:

gdsc.feature_names

The best is to do it in one go though since it will also fill the FDR correction column based on all associationa computed.

One Drug All Features (ODAF)

Now that we have analysed one drug for one feature, we could repeat the analysis for all features. However, we provide a method that does exactly that for us (anova_one_drug()):

from gdsctools import ANOVA, ic50_test
gdsc = ANOVA(ic50_test)
results = gdsc.anova_one_drug(999)
results.volcano()

(Source code, png, hires.png, pdf)

_images/anova_partone-2.png

In a python shell, you can click on a dot to get more information.

Here, we have a different plot called a volcano plot provided in the gdsctools.volcano module. Before explaining it, let us understand the x and y-axis labels.

Each row in the dataframe produced by anova_one_drug() is made of a set of statistical metrics (look at the header results.df.columns). It includes a p-value (coming from the ANOVA analysis) and a signed effect size can also be computed as follows.

In the ANOVA analysis, the population of IC50s is split into positive and negative sets (based on the genomic feature). The two sets are denoted IC50_{pos} and IC50_{neg}. Then, the signed effect size \eta is computed as follows:

\eta = \rm{sgn} ( \Delta) \times Es(IC50_{pos}, IC50_{neg})

where

\Delta = \overline{IC50_{pos}} - \overline{IC50_{neg}}

and Es is the effect size function based on the Cohens metric (see gdsctools.stats.cohens()).

In the volcano plot, each drug vs genomic feature has a p-value. Due to the increasing number of possible tests, we have more chance to pick a significant hit by pure chance. Therefore, p-values are corrected using a multiple testing correction method (e.g., BH method). The column is labelled FDR. Significance of associations should therefore be based on the FDR rather than p-values. In the volcano plot, horizontal dashed lines (red) shows several FDR values and the values are shown in the right y-axis. Note, however that in this example there is no horizontal lines. Indeed, the default value of 25% is well above the limits of the figure telling us that there is no significant hits.

Note that the right y-axis (FDR) is inversed, so small FDRs are in the bottow and the max value of 100% should appear in the top.

Note

P-values reported by the ODOF method need to be corrected using multiple testing correction. This is done in the the ODAF and ADAF cases. For more information, please see the gdsctools.stats.MultipleTesting() description.

All Drug All Features (ADAF)

Here we compute the associations across all drugs and all features. In essence, it is the same analysis as the ODAF case but with more tests.

In order to reduce the computational time, in the following example, we restrict the analysis to the breast tissue using set_cancer_type() method. This would therefore be a cancer-specific analysis. If all cell lines are kept, this is a PANCAN analysis. The information about tissue is stored in the genomic feature matrix in the column named TISSUE_FACTOR.

from gdsctools import ANOVA, ic50_test
gdsc = ANOVA(ic50_test)
gdsc.set_cancer_type('breast')
results = gdsc.anova_all()

results.volcano()

(Source code, png, hires.png, pdf)

_images/anova_partone-3.png

Warning

anova_all() may take a long time to run (e.g., 10 minutes, 30 minutes) depending on the number of drugs and features. We have a buffering in place. If you stop the analysis in the middle, you can call again anova_all() method and previous ODAF analysis will be retrieved starting the analysis where you previously stoped. If this is not what you want, you need to call reset_buffer() method.

The volcano plot here is the same as in the previous section but with more data points. The output is the same as in the previous section with more associations.

Learn more

If you want to learn more, please follow one of those links:

The ANOVA analysis in details

About the settings

When using the ANOVA instance or the ANOVAReport to create an HTML report (see HTML report), all tunable settings are accessible from an attribute called settings:

from gdsctools import ANOVA, ic50_test
gdsc = ANOVA(ic50_test)
gdsc.settings

This settings attribute is an instance of gdsctools.settings.ANOVASettings, which is fully documented in the reference. For example in the HTML report section, we will change the default output directory where HTML pages are saved to a user-defined value.

It is also important to note that when calling ANOVAReport, the first argument is an ANOVA instance that already contains the settings. So, ANOVAReport will use that settings automatically (it may still be changed later). Consider this example:

>>> from gdsctools import ANOVA, ic50_test, ANOVAReport
>>> gdsc = ANOVA(ic50_test)
>>> gdsc.settings.FDR_threshold = 15
>>> results = gdsc.anova_all()

>>> ar = ANOVAReport(gdsc, results)
>>> ar.settings.FDR_threshold
15

We will see more settings here below but first let us come back on the ANOVA analysis.

Regression analysis

By default, the regression uses an OLS method. 4 Factors may be taken into account depending on the content of the GenomicFeature data set.

Here below, we use the R syntax where C() stands for categorical data and Y is the variable to be explained. Depending on the data and the gdsctools.anova.settings one of the following formula will be used:

Y \sim C(tissue) + C(media) + C(MSI) + Feature

Y \sim C(tissue) + C(MSI) + Feature

Y \sim C(MSI) + Feature

Y \sim Feature

Here are some rules applied:

  • Feature factor is always included by definition and is in last position
  • MSI and Media are included by default if found in the genomic feature data set. Note, however than one can exclude these factors using the settings.
  • Tissue is included if there are more than 2 tissues. Again, one can change the settings.analysis_type to the name of the tissue (instead of PANCAN, the default value).

Note

The order of the different features in the equations may have an impact on the analysis.

Since analysis may be time-consuming, we have hard-coded the regression formula. Note, however, that in version 0.16, we have added the anova_one_drug_one_feature_custom() method, which can be use for any type of regression based on a user formula. This is slower than the hard-coded version mentionned above but is more flexible. One can for instance set the formula to specify the treatement to be used as a reference:

an.settings['regression_formula'] = "Y ~ C(tissue, Treatment(reference='breast'))"

The ANOVA analysis itself uses a type I error. The summary can be obtained for a specific combination of drug and feature as follows:

from gdsctools import *
an = ANOVA(ic50_test, gf_v17)
drugid = 1047
feature = an.feature_names[0]
odof = an._get_one_drug_one_feature_data(drugid, feature)
res = an.anova_one_drug_one_feature(drugid, feature)
an._get_anova_summary(an.data_lm, output="dataframe", odof=odof)

and should show the following summary:

              Df       Sum Sq    Mean Sq  F value       PR(>F)
tissue      26.0   352.345257  13.551741  9.26853  1.63864e-31
msi          1.0     5.309389   5.309389  3.63129    0.0570537
feature      1.0     3.186109   3.186109   2.1791     0.140282
Residuals  817.0  1194.554709   1.462123     None         None

An alternative (simpler but slower) way since version 0.16 is to use:

an.anova_one_drug_one_feature_custom(drugid, feature,
    formula='Y ~ C(tissue) + C(msi) + feature')

Type I error

The ANOVA analysis is based on a Type I error, also called sequential sum of squares. Consider 2 effects A and B, it tests the main effect of factor A, followed by the main effect of factor B after the main effect of A, followed by the interaction effect AB after the main effects. So, this type of sums of squares gives different results for unbalanced data depending on the sequence.

MSI factor

MSI is always included by default. However, you may exclude it by setting its value to False:

settings.include_MSI_factor

If MSI_FACTOR column is not found in the Genomic Feature data set, the MSI factor will be excluded automatically and the parameter above set to False.

Warning

If you force the MSI factor to True whereas there is not enough data in the binary sets of the MSI factor, error will be raised.

MEDIA factor

If included in the genomic feature data set, MEDIA are included by default. However, you may exclude it by setting its value to False:

settings.include_MEDIA_factor

If MEDIA_FACTOR column is not found in the Genomic Feature data set, the MEDIA factor will be set automatically to False.

Tissue factor

Another factor used in the regression (tissue) will be automatically excluded if there is only one tissue (or none). If several tissues are available, you can still exclude it from the regression analysis by settings this parameter to anything different from the default value (PANCAN):

settings.analysis_type = PANCAN

Filtering

When performing the analysis for a given drug and feature, the regression may not be performed if there is not enough statistics.

These parameters will influence the number of tests being performed (number of associations of drug vs feature in anova_all()):

- minimum_nonna_ic50
- MSI_feature_threshold
- feature_factor_threshold

The first parameter indicates the minimum number of valid IC50 required for a given drug to be analysed. The current default value is 6.

The second parameter indicates the minimum size of the positive and negative population when IC50 are filtered by MSI factor (defaults to 2).

The third parameter indicates the minimum size of the positive and negative population when IC50 are filtered by Feature factor (defaults to 3).

This table summarizes the effect of these parameters:

_images/regression_filtering.png

The left hand side table mimics the IC50 data. The first column should and last 3 rows are not to be included in an IC50 matrix (see Data Format and Readers) but are added here as annotations for the following discussions.

When the regression analysis is performed for a given drug and a given feature, 3 filters are applied. First, a minimum number of values is required (minimum_nonna_ic50 setting). Therefore, the drug is not analysed. The second check is performed with respect to the MSI values. A drug can be analysed only if (once NA have been discarded) the number of IC50s corresponding to positive and negative MSIs is greater or equal to MSI_feature_threshold. In our example, the drugs in column D_pMSI=0 and D_pMSI=1 are therefore discarded since they have zero and only one positive MSI, respectively.

Finally, similarly to the MSI check, a drug/feature association is analysed if the number of IC50s corresponding to positive and negative feature is or equal to feature_factor_threshold.

Multiple testing corrections

By default, the multiple testing correction is based on Benjamini–Hochberg (BH) method but it can be set to other methods using

settings.pval_correction_method

See also

MultipleTesting for details.

The multiple testing is performed globally across all drugs and all cell lines.This parameter is stored in

settings.pvalue_correction_level

By default it is set to global. Set it to local to keep the multiple correction at the drug level (ODAF).

When you perform an ANOVA analysis, the multiple correction method is used to populate the results column named ANOVA_FEATURE_FDR.

If you change your mind and wish to run the analysis with another method, you do not need to re-run the entire analysis. Instead, simply change the method’s name and call anova_all() again. Only the multiple testing computation is performed, skipping ANOVA testing, which have already been done.

results = an.anova_all()
an.settings.pvalue_correction_method = 'qvalue'
results = an.anova_all()

volcano plots

The volcano plots are one of the main results of the analysis and summarizes visually the significance of the different associations.

It is part of the AnovaResults class and is returned either by an ODAF or ADAF analysis:

from gdsctools import ANOVA, ic50_test
gdsc = ANOVA(ic50_test)
res = gdsc.anova_all()
res.volcano()

(Source code, png, hires.png, pdf)

_images/anova_parttwo-1.png

Here are some parameters used to tune the plots and selection of significant events:

  • pvalue_threshold is used to select significant hits. See ANOVAReport.
  • effect_threshold is used to select significant hits as well.
  • FDR_threshold is used in gdsctools.volcano.VolcanoANOVA (horizontal lines)
  • volcano_FDR_interpolation uses interpolation to plot the FDR lines in the volcano plot.
  • volcano_additional_FDR_lines : [0.01, 0.1, 10]

See also

VolcanoANOVA.

others

See ANOVASettings for the full listing.

Note

Some settings will be set automatically when calling some functions. For instance, if you call anova.ANOVA.set_cancer_type() to a single tissue, then the analysis_type will be set to the tissue’s name. If there are not enough positive or negative MSI, the MSI factor will ignored.

HTML report

The output of an ANOVA analysis can be used to create an HTML report.

First, let us generate the results again (see Quick Start).

>>> from gdsctools import ANOVA, ic50_test
>>> gdsc = ANOVA(ic50_test)
>>> gdsc.set_cancer_type('breast')
>>> results = gdsc.anova_all()

The results variable contains a dataframe df. This dataframe contains as many rows as associations of drugs and features. See ANOVAResults for the contents. The HTML report extracts the significant associations, and then create figures and HTML pages for each of the associations that are significant. You can easily create HTML report as follows:

>>> from gdsctools import ANOVAReport
>>> report = ANOVAReport(gdsc, results)
>>> report.create_html_pages()

The report creates a Data Package, which details can be found in the Data Packages section.

Images are created for each significant associations and may take a while.

Some tunable settings are available in the settings (see About the settings). For instance, you can set the output directory to a user value instead of (html_gdsc_anova):

>>> report.settings.directory = 'BLCA'

Data Packages

Warning

Only one IC50 files should be provided. It is filtered out according to the genomic feature file.

Definition

A Data Package is a terminology used to speak about a directory that contains the results of an analysis. For example, the data package called BLCA has a tree directory that looks like:

BLCA/
|-- code
|-- css
|-- images
|-- INPUT
|-- js
`-- OUTPUT

The main directory contains a file named index.html. Many other HTML files may be present but the index is your entry point to browse the content of data package.

The directory css and js contains resources required by the HTML documents.

The INPUT directory contains 3 data files and the settings used during the analysis:

  1. ANOVA_input.csv
  2. DRUG_DECODE.csv
  3. genomic_features.csv
  4. settings.json

See section on Data Format and Readers for details about the data format and the About the settings section.

Finally, the OUTPUT directory contains:

- drugs_summary.csv
- features_summary.csv
- results.csv

The images directory contains a mix of images and HTML for each significant associations.

Create your own package

In fact, we have already seen how to create a package. This is covered in HTML report when we used the ANOVAReport class but let us look at the code again:

from gdsctools import ANOVA, ic50_test, ANOVAReport
gdsc = ANOVA(ic50_test)
results = gdsc.anova_all()

report = ANOVAReport(gdsc, results)
report.create_html_pages()

Here, we have not yet mentionned the type of cancers or tissues since we used a simple genomic feature file but one we need to repeat this analysis across man y different genomic features files. The gdsctools.gdsc.GDSC class will help us for that purpose.

Create data packages across TCGA

When we do a full GDSC analysis, the cell lines span a set of TCGA tissues (e.g., COREAD, BLCA) and generally we want to perform the analysis not on all cellines at the same time but each type of tissues independently.

Besides, you may then wish to have data packages not only for a given TCGA tissue but also for a given company (if your DrugDecode file is filled properly; see later).

The recommended way is to used the gdsctools.gdsc.GDSC class that will help you in this task.

First, you need to prepare the input data. Create a directory and add these files:

- a unique IC50 file
- The genomic features files for each type of tissues.
- The DrugDecode file

The genomic feature must be named as follows:

<prefix>_BLCA.csv
<prefix>_COREAD.csv
...

The name of the TCGA can include ALL, PANCAN and will be used later to create the directories for each data paakage.

The important point being that there must be an underscore only and followed by the TCGA tag.

The GDSC class will then loop over the TCGA cases and create data packages.

from gdsctools import GDSC
gg = GDSC("IC50.csv", "DrugDecode.csv", "GF_*.csv")
gg.analyse()

This may take hours to finalise: the ANOVA and creation of all images will be done for each TCGA.

This may be parallelised since each input Genomic Feature analysis is independent:

gg_blca = GDSC("IC50.csv", "DrugDecode.csv", "GF_BLCA.csv")
gg_blca.analyse()

gg_coread = GDSC("IC50.csv", "DrugDecode.csv", "GF_COREAD.csv")
gg_coread.analyse()

In an error occurs for one Genomic Feature file, the analysis we jump to the next file. You may need to check re-run the specific TCGA tissue analysis your self when an error occured (meaning you do not need to re-run everything).

Once done, you should have all data packages locally in the directory where you ran the scripts.

The next step is to read back all those results and create data pacakges dedicated to a company. Based on the DRUG_DECODE file:

gg = GDSC("IC50.csv", "DrugDecode.csv", "GF_*.csv")
gg.create_data_packages_for_companies()

For each companies, which names can be checked with:

gg.companies

a new directory (data package) is created locally

For now, it is important to run this in the same directory where previous pacakges were created.

Again this may be parallelised:

for each company in gg.companies:
    single = GDSC("IC50.csv", "DrugDecode.csv", "GF_*.csv")
    single.create_data_packages_for_companies([company])

Create summary pages

Following the creating of the “all” TCGA packages and the dedicated packages for all companies, you end up with quite a few directories. This command will create summary HTML page to ease your life:

gg.create_summary_pages()

This must be called after analyse() and create_data_packages_for_companies().

OmniBEM Builder

OmniBEM Builder is an optional tool within GDSCTools designed to give the user more flexibility on the levels of genomic annotation probed by GDSCTools.

By default, GDSCTools is based on the genomic annotation of 1001 cell lines represented in COSMIC and published by Iorio et al (Cell Resource). The annotation includes genetic variants as identified by exome sequencing, copy number alterations and differentially methylated CPG islands.

OmniBEM Builder allows the user to merge the different levels of annotations into a single gene-level view that queries whether a given gene has been altered at any level of annotations.

The user can additionally specify which sets of genomic annotations to integrate as well as upload and integrate their own sets of genomic annotations.

from gdsctools import OmniBEM

Input Data Structure

At its most basic, OmniBEM Builder requires a cell line ID (e.g. COSMIC ID), a gene name and an alteration type (e.g. Point mutation, Amplification or Copy Number Alteration). A simple example table is given below:

COSMIC ID GENE TYPE SAMPLE TISSUE_TYPE IDENTIFIER
111 ZAP Methylation NM breast 1

Example

We provide a data set available in GDSCTools that can be loaded as follows

from gdsctools import gdsctools_data, OmniBEMBuilder

input_data = gdsctools_data("test_omnibem_genomic_alterations.csv.gz")
bem = OmniBEMBuilder(input_data)

The data is stored in the df attribute:

bem.df

From this data frame, one can filter, group and perform various data mangling operations. For instance these commands group the data by tissue type, count the number of row per tissue and return the 5 most representative tissues:

>>> count = bem.df.groupby("TISSUE_TYPE").count()
>>> list(count.sort_values("COSMIC_ID").index[-5:])
['SKCM', 'BRCA', 'HNSC', 'LUAD', 'COAD/READ']

In OmniBEMBuilder, we provide convenient methods to filter the data by genes, cosmic IDs, sample names, tissues and types.

For instance:

>>> # You may filter the data for instance to keep only a set of genes.
>>> # Here, we keep the 100 most present genes:
>>> bem = OmniBEMBuilder(input_data)
>>> len(bem)
56943
>>> gene_list = bem.get_significant_genes(100)
>>> bem.filter_by_gene_list(gene_list.index)  # gene_list is a dataframe
>>> len(bem)
6141

Once you BEM data is filtered as expected, save it:

gf = bem.get_genomic_features()
gf.to_csv("Your_genomic_features.csv")

This file can now be used with tha ANOVA or Regression analysis.

Regression analysis

Since version 0.15, we have also the ability to perform a regression analysis at the drug level (ODAF mode).

The are currently 3 types of regression implemented:

  1. Ridge
  2. Lasso
  3. Elastic Net

Information about functionalities implemented in GDSCTools are available in the reference gdsctools.regression. and in a notebook (see https://github.com/CancerRxGene/gdsctools/tree/master/notebooks) called Regression.

We first give some examples following information contained in the reference and the notebook. In the example here below, the data sets are small. In practice, you may have many features and would require a cluster to perform the computation in a reasonable amount of time. We therefore develop a pipeline that wraps up the analysis. The pipeline uses the Snakemake framework (see below).

Warning

you must install Snakemake, in which case you must use Python >=3.5

Doing the analysis using GDSCTools library

As usual, we will use some data provided in GDSCTools in the form of ain IC50 data set and a genomic features data set:

from gdsctools import *
ic50 = IC50(ic50_v17)
gf = GenomicFeatures(gf_v17)

To decrease the computational time, let us select only genomic features that are factors or mutations:

gf.df = gf.df[[x for x in gf.df.columns if "FACTOR" in x or "mut" in x]]

Here, we will use the Lasso regression:

lasso = GDSCLasso(ic50, gf)

For each drug, all features are taken and a regression is performed. For each drug, we tune the alpha parameter using a cross validation (a K-fold CV).

Note

the regression implementation is based on scikit-learn. The alpha parameter is called lambda in the R glmnet package. In the ElasticNet case, and additional parameter is called l1_ratio (0.5 by default) and corresponds to the glmnet parameter alpha.

Before, doing the tuning, let us choose one drug. Let us pick up an interesting one:

drugid = 1047

Tuning

Users are able to choose the number of K-folds, which is set to 10 by default. Here is the method to call:

res = lasso.runCV(drugid, kfolds=8)

The returned object contains the best alpha but also the pearson coefficient:

res.alpha

From this parameter, the best model can be created and used for further analysis, in particular to see the important features:

best_model = lasso.get_model(alpha=res.alpha)
weights = lasso.plot_weight(drugid, best_model)

(Source code, png, hires.png, pdf)

_images/regression-1.png

Validation

The runCV function mentionned above does not plot any figures for optimisation reasons. We implemented another function called tune_alpha, which has a visual representation. This is however 20 time slower than runCV and is not used in production.:

lasso.tune_alpha(drugid, alpha_range=[-2.8,-0.5])

(Source code, png, hires.png, pdf)

_images/regression-2.png

Another important function is the check_randomness method. It runs N times the gdsctools.regression.GDSCLAsso.runCV() function, and N times the same analysis shuffling the Y data. This creates a NULL model. The Pearson correlation values between the NULL model and the real data is then compared using a Bayes factor metric (independent of N).

boxplots

boxplots = lasso.boxplot(drugid, model=best_model, n=10, bx_vert=False)

(Source code, png, hires.png, pdf)

_images/regression-3.png

ADAF analysis

We have now a good picture of what the regression tools can do. If one wants to play with ElasticNet or Ridge methods, just replaced GDSCLasso by GDSCElasticNet or GDSCRidge.

We now want to run the regression on all drugs. This can be done manually of course using a loop over each drug identifiers:

for drugID in lasso.drugIds:
    res = lasso.runCV(drugid, kfolds=8)
    best_model = lasso.get_model(alpha=res.alpha)
    weights = lasso.plot_weight(drugid, best_model)
    boxplots = lasso.boxplot(drugid, model=best_model, n=10, bx_vert=False)
    # Save images here

The snakemake pipeline

Warning

This is only available for Python 3.5 users since the snakemake utility is only available for Python 3.

We provide a pipeline in a form of a snakemake file. The pipeline is called regression.rules and a config file named regression.yaml is also provided.

The workflow looks like:

_images/regression.png

Imagine the case where you have 4 drugs, then results and weights are computed for each drug. This is parallelised on a distributed-computer. Once the computation is performed, a report is created.

The path of those files can be obtained using

from gdsctools import gdsctools_data
gdsctools_data("regression.rules", "../pipelines")
gdsctools_data("regression.yaml", "../pipelines")

Those two files must be copied in a local directory.

Then, edit the config file that looks like:

regression:
    method: lasso
    kfold: 10
    randomness: 50

input:
    ic50:
    genomic_features:

so as to set the input IC50 and genomic_features files. Once done, you can run the analysis. Just type:

snakemake -s regression.rules -j 4

Or a cluster, you may add the following information (for instance on a slurm system):

snakemake -s regression.rules -j 40 --cluster "sbatch --qos normal"

where -j 40 means uses 40 cores. Wait until it is finished. You should have an index.html file at the end.

Note

There is a standalone that fetches the pipeline and its config file, autofilled with user’s argument ready to run. The standalone is called gdsctools_regression. Please see Standalone application section

Notebooks

Within GDSCTools, we also provide a set of IPython notebooks. Those notebooks are like recipes and tutorials and reproduce some of the examples available in this documentation. We encourage you to have a look at the IPython and IPython notebook online documentation if you want to create your own notebooks.

Note

IPython is a growing project, with increasingly language-agnostic components. As of IPython 4.0, the language-agnostic parts of the project (e.g., notebook) have moved to new projects under the name Jupyter.

The original notebooks that we provide in GDSCTools are available in the GitHub repository and can visualised via the nbviewer.

Warning

if you installed GDSCTools with pip, it may not be obvious to find the notebooks on your system, in which case we would recommend to download the notebooks form the source code (link above).

Note, however, that to get the best of the those notebooks (interactivity), you should try them locally:

  • install GDSCTools and IPython if not already done.

  • If you installed from source, the notebooks are in ./gdsctools/notebooks, otherwise, download the notebooks from the online repository.

  • Go in the directory where are located the notebooks and type (in a shell):

    ipython notebook
    

    This should start the interactive notebook in a browser and you should see the notebooks. Click on one of them and go through the notebooks.

The notebooks cover many different aspects of the GDSCTools library including the quickstart discussed in this documentation. We will not give more details about their contents since they may evolve and change we time. We encourage you to create new ones and to share them.

Standalone application

Although we would encourage you to use the Python shell to have as much flexibility as possible, we also provide a standalone applications.

Currently, there are two standalones. The gdsctools_anova and gdsctools_regression. The first one is a pure Python implementation while the second one is snakemake-based.

gdsctools_anova application

called gdsctools_anova. This standalone application should be installed with GDSCTools automatically. It focuses on the ANOVA analysis only, and can be used to analysis one set of IC50 and genomic feature at a time.

You can obtain the help by typing:

gdsctools_anova --help

The main goal is to provide an interface to the python library and consequently, one be able to redo the analysis as shown in the quickstart:

* One drug One Feature with figure(s) and HTMLs
* One Drug All Feature with figure and HTMLs
* All Drug All Feature with figures and HTMLs

We suppose the input data file is called IC50_10drugs.tsv

ODOF

gdsctools_anova --input-ic50 IC50_10drugs.tsv --drug
    Drug_999_IC50 --feature TP53_mut --onweb

ODAF

gdsctools_anova --input-ic50 IC50_10drugs.tsv --drug
    Drug_999_IC50 --onweb

ADAF

gdsctools_anova --input-ic50 IC50_10drugs.tsv --onweb

Some other settings

Again, you can use the --help to get up-to-date information about the available arguments. However, let us give a couple of interesting ones.

If you are interesting in a specific association of drug and feature, it is convenient to get the valid drug names:

--print-drug-names

or feature names:

--print-feature-names

By default the analysis is PANCAN (includes all tissues) but you can restrict the analysis to a set of tissues (or just one):

--tissues breast, cervix

To know the names of the tissues, use:

--print-tissue-names

gdsctools_regression application

Let us consider the case where you have an IC50 file and a genomic file. The first step consists in preparing the working directory:

gdsctools_regression -I IC50_v17.csv.gz -F genomic_features_v17.csv.gz
    --method lasso -O lasso_analysis
cd lasso_analysis

On a local computer:

snakemake -s regression.rules -j 4

On a distributed-computing system using e.g SLURM framework, use:

srun --qos normal snakemake -s regression.rules -j 40 --cluster "sbatch --qos normal"

Reproducibility

In this section we show how to reproduce some specific results such as those published in the GDSC1000 paper (see Notes to reproduce v17a results (GDSC1000 Cell paper) section).

The motivation for this section is to show that results can be reproduced using an official release and the proper set of parameters used in the analysis.

Notes to reproduce v17a results (GDSC1000 Cell paper)

Reference:Iorio et al 2016

We refer to the data used in the reference above as the v17a data sets. It can be found on the CancerRxGene page. We will use this data to reproduce the results published in the same web page.

First, we need to download the data. For example, let us retrieve the BRCA tissue specific data set using wget command:

wget http://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources//Data/anova/BRCA/DATA/INPUT/ANOVA_input.txt

Warning

in v17a data, the input file is named ANOVA_input.txt It is a tabulated format and contains all IC50s and genomic features altogether. This is not recommended and we now expect the IC50s and genomic_features to be in two different files. However, for back compatibility, GDSCTools is able to extract the IC50 alone, or the genomic features alone from that kind of format. See hereafter and Data Format and Readers section for details.

Once downloaded, create an ANOVA instance as follows:

from gdsctools import ANOVA, ANOVAReport
an = ANOVA("ANOVA_input.txt", "ANOVA_input.txt")

Here, we provide the filename twice; this is not a mistake, please see the warning box above. Internally, the first argument extracts the IC50s and the second one extracts the genomic features.

In v17a, there are two parameters that are not the current default values that must be set to reproduce the results:

an.settings.pvalue_correction_method = "qvalue"
an.settings.equal_var_ttest = False

The later parameter is used in some plots for annotation but is not essential. The first parameter is important since it sets the method used for multiple testing correction. It will also define the number of associations that are significant. The FDR threshold that defines the significant associations was set to 25.

In the case of a tissue specific analysis (here BRCA), the name of the tissue is unknown (not specified anywhere inside the file) and so one should provide the information. This is not important for the analysis itself but is used for instance to name the output of the directory where HTML reports are stored:

report = ANOVAReport(an)
report.settings.directory = "BLCA"
report.create_html_pages()

Note that the multiple corrected values reported by GDSCTools and found on the website are different by a systematic bais of 2-3 %. This is known and due to a different implementation of the qvalue method (smoothing function). However, the number of tests and the ANOVA_FEATURE_pval column (pvalues of the FEATURE factor) should agree perfectly. Finally, note that because the value of the FDR (corrected values) differ, the number of significant associations below that threshold may also slightly differ. However, results are consitent for FDR not close to the threshold.

As for the PANCAN case, results are currently different between GDSCTools and what is posted within the link at the top of the page because there is currently a mismatch between the ANOVA_input file provided and the results provided (one has the MEDIA factor while there other has not)

For information, on an intel i7 core, the analysis of the PANCAN data set (265 drugs and about 1000 features) takes about 20 minutes to finish. Tissue specific data files takes a few minutes or less in general.

Notes about v18 onwards

In V18 onwards (May 2016), the IC50 may have duplicated columns for a given drug. So some drugs are clustered together. The algorithm was implemented in GDSCTools in IC50Cluster class. It should be used as follow.

IC50 must be read with the IC50Cluster class as follows:

from gdsctools import *
ic50 = IC50Cluster("v18_data")

This will ensure also that the drug identifiers are unique. Indeed, in v18 data sets, columns for a given DRUG ID may be duplicated (for different drug concentration).

Then, as usual:

an = ANOVA(ic50, "GF.csv", "DRUG_DECODE.csv")

All default parameters were used except for the FDR threshold:

report.settings.FDR_threshold = 35

References

Statistical Tools

Code related to the ANOVA analysis to find associations between drug IC50s and genomic features

class MultipleTesting(method=None)[source]

This class eases the computation of multiple testing corrections

The method implemented so far are based on statsmodels or a local implementation of qvalue method.

method name Description
bonferroni one-step correction
sidak one-step correction
holm-sidak step down method using Sidak adjustments
holm step down method using Bonferroni adjustments
simes-hochberg step up method (independent)
hommel close method based on Simes tests (non negative)
fdr_bh FDR Benjamini-Hochberg (non-negative)
fdr_by FDR Benjamini-Yekutieli (negative)
fdr_tsbky FDR 2-stage Benjamini-Krieger-Yekutieli non negative
frd_tsbh FDR 2-stage Benjamini-Hochberg’ non-negative
fdr same as fdr_bh
qvalue see QValue class

See also

gdsctools.qvalue.

Constructor

Parameters:method – default to fdr that is the FDR Benjamini-Hochberg correction.
get_corrected_pvalues(pvalues, method=None)[source]

Return corrected pvalues

Parameters:
  • pvalues (list) – list or array of pvalues to correct.
  • method – use the one defined in the constructor by default but can be overwritten here
method

get/set method

plot_comparison(pvalues, methods=None)[source]

Simple plot to compare the pvalues correction methods

from gdsctools.stats import MultipleTesting
mt = MultipleTesting()
pvalues = [1e-10, 9.5e-2, 2.2e-1, 3.6e-1, 5e-1, 6e-1,8e-1,9.6e-1]
mt.plot_comparison(pvalues,
    methods=['fdr_bh', 'qvalue', 'bonferroni', 'fdr_tsbh'])

(Source code, png, hires.png, pdf)

_images/references-2.png

Note

in that example, the qvalue and FDR are identical, but this is not true in general.

valid_methods = None

set of valid methods

cohens(x, y)[source]

Effect size metric through Cohen’s d metric

Parameters:
  • x – first vector
  • y – second vector
Returns:

absolute effect size value

The Cohen’s effect size d is defined as the difference between two means divided by a standard deviation of the data.

d = \frac{\bar{x}_1 - \bar{x}_2}{s}

For two independent samples, the pooled standard deviation is used instead, which is defined as:

s = \sqrt{  \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2} }

A Cohen’s d is frequently used in estimating sample sizes for statistical testing: a lower d value indicates the necessity of larger sample sizes, and vice versa.

Note

we return the absolute value

References:https://en.wikipedia.org/wiki/Effect_size
signed_effects(df)[source]

Implementation of qvalue estimate

Author: Thomas Cokelaer

class QValue(pv, lambdas=None, pi0=None, df=3, method='smoother', smooth_log_pi0=False, verbose=True)[source]

Compute Q-value for a given set of P-values

Constructor

The q-value of a test measures the proportion of false positives incurred (called the false discovery rate or FDR) when that particular test is called significant.

Parameters:
  • pv – A vector of p-values (only necessary input)
  • lambdas – The value of the tuning parameter to estimate pi_0. Must be in [0,1). Can be a single value or a range of values. If none, the default is a range from 0 to 0.9 with a step size of 0.05 (inluding 0 and 0.9)
  • method – Either “smoother” or “bootstrap”; the method for automatically choosing tuning parameter in the estimation of pi_0, the proportion of true null hypotheses. Only smoother implemented for now.
  • df – Number of degrees-of-freedom to use when estimating pi_0 with a smoother (default to 3 i.e., cubic interpolation.)
  • pi0 (float) – if None, it’s estimated as suggested in Storey and Tibshirani, 2003. May be provided, which is convenient for testing.
  • smooth_log_pi0 – If True and ‘pi0_method’ = “smoother”, pi_0 will be estimated by applying a smoother to a scatterplot of log pi_0 rather than just pi_0

Note

Estimation of pi0 differs slightly from the one given in R (about 0.3%) due to smoothing.spline function differences between R and SciPy.

If no options are selected, then the method used to estimate pi_0 is the smoother method described in Storey and Tibshirani (2003). The bootstrap method is described in Storey, Taylor & Siegmund (2004) but not implemented yet.

estimate_pi0(pi0)[source]

Estimate pi0 based on the pvalues

qvalue()[source]

Return the qvalues using pvalues stored in pv attribute

Readers

IC50, Genomic Features, Drug Decode

IO functionalities

Provides readers to read the following formats

class IC50(filename, v18=False)[source]

Reader of IC50 data set

This input matrix must be a comman-separated value (CSV) or tab-separated value file (TSV).

The matrix must have a header and at least 2 columns. If the number of rows is not sufficient, analysis may not be possible.

The header must have a column called “COSMIC_ID” or “COSMIC ID”. This column will be used as indices (row names). All other columns will be considered as input data.

The column “COSMIC_ID” contains the cosmic identifiers (cell line). The other columns should be filled with the IC50s corresponding to a pair of COSMIC identifiers and Drug. Nothing prevents you to fill the file with data that have other meaning (e.g. AUC).

If at least one column starts with Drug_, all other columns will be ignored. This was implemented for back compatibility.

The order of the columns is not important.

Here is a simple example of a valid TSV file:

COSMIC_ID   Drug_1_IC50 Drug_20_IC50
111111      0.5         0.8
222222      1           2

A test file is provided in the gdsctools package:

from gdsctools import ic50_test

You can read it using this class and plot information as follows:

from gdsctools import IC50, ic50_test
r = IC50(ic50_test)
r.plot_ic50_count()

(Source code, png, hires.png, pdf)

_images/references-3.png

You can get basic information using the print function:

>>> from gdsctools import IC50, ic50_test
>>> r = IC50(ic50_test)
>>> print(r)
Number of drugs: 11
Number of cell lines: 988
Percentage of NA 0.206569746043

You can get the drug identifiers as follows:

r.drugIds

and set the drugs, which means other will be removed:

r.drugsIds = [1, 1000]

Changed in version 0.9.10: The column COSMIC ID should now be COSMIC_ID. Previous name is deprecated but still accepted.

Constructor

Parameters:filename – input filename of IC50s. May also be an instance of IC50 or a valid dataframe. The data is stored as a dataframe in the attribute called df. Input file may be gzipped
copy()[source]
cosmic_name = 'COSMIC_ID'
drop_drugs(drugs)[source]

drop a drug or a list of drugs

drugIds

list the drug identifier name or select sub set

drug_name_to_int(name)[source]
get_ic50()[source]

Return all ic50 as a list

hist(bins=20, **kargs)[source]

Histogram of the measured IC50

Parameters:
  • bins – binning of the histogram
  • kargs – any argument accepted by pylab.hist function.
Returns:

all measured IC50

from gdsctools import IC50, ic50_test
r = IC50(ic50_test)
r.hist()

(Source code, png, hires.png, pdf)

_images/references-4.png
plot_ic50_count(**kargs)[source]

Plots the fraction of valid/measured IC50 per drug

Parameters:kargs – any valid parameters accepted by pylab.plot function.
Returns:the fraction of valid/measured IC50 per drug
class GenomicFeatures(filename=None, empty_tissue_name='UNDEFINED')[source]

Read Matrix with Genomic Features

These are the compulsary column names required (note the spaces):

  • ‘COSMIC_ID’
  • ‘TISSUE_FACTOR’
  • ‘MSI_FACTOR’

If one of the following column is found, it is removed (deprecated):

- 'SAMPLE_NAME'
- 'Sample Name'
- 'CELL_LINE'

and features can be also encoded with the following convention:

  • columns ending in “_mut” to encode a gene mutation (e.g., BRAF_mut)
  • columns starting with “gain_cna”
  • columns starting with “loss_cna”

Those columns will be removed:

  • starting with Drug_, which are supposibly from the IC50 matrix
>>> from gdsctools import GenomicFeatures
>>> gf = GenomicFeatures()
>>> print(gf)
Genomic features distribution
Number of unique tissues 27
Number of unique features 677 with
- Mutation: 270
- CNA (gain): 116
- CNA (loss): 291

Changed in version 0.9.10: The header’s columns’ names have changed to be more consistant. Previous names are deprecated but still accepted.

Changed in version 0.9.15: If a tissue is empty, it is replaced by UNDEFINED. We also strip the spaces to make sure there is “THIS” and “THIS ” are the same.

Constructor

If no file is provided, using the default file provided in the package that is made of 1001 cell lines times 680 features.

Parameters:empty_tissue_name (str) – if a tissue name is let empty, replace it with this string.
colnames = {'cosmic': 'COSMIC_ID', 'media': 'MEDIA_FACTOR', 'msi': 'MSI_FACTOR', 'tissue': 'TISSUE_FACTOR'}
compress_identical_features()[source]

Merge duplicated columns/features

Columns duplicated are merged as follows. Fhe first column is kept, others are dropped but to keep track of those dropped, the column name is renamed by concatenating the columns’s names. The separator is a double underscore.

gf = GenomicFeatures()
gf.compress_identical_features()
# You can now access to the column as follows (arbitrary example)
gf.df['ARHGAP26_mut__G3BP2_mut']
drop_tissue_in(tissues)[source]

Drop tissues from the list

Parameters:tissues (list) – a list of tissues to drop. If you have only one tissue, can be provided as a string. Since rows are removed some features (columns) may now be empty (all zeros). If so, those columns are dropped (except for the special columns (e.g, MSI).
features

return list of features

fill_media_factor()[source]

Given the COSMIC identifiers, fills the MEDIA_FACTOR column

If already populated, replaced by new content.

get_TCGA()[source]
keep_tissue_in(tissues)[source]

Drop tissues not in the list

Parameters:tissues (list) – a list of tissues to keep. If you have only one tissue, can be provided as a string. Since rows are removed some features (columns) may now be empty (all zeros). If so, those columns are dropped (except for the special columns (e.g, MSI).
plot(shadow=True, explode=True, fontsize=12)[source]

Histogram of the tissues found

from gdsctools import GenomicFeatures
gf = GenomicFeatures() # use the default file
gf.plot()

(Source code)

shift
tissues

return list of tissues

unique_tissues

return set of tissues

class Reader(data=None)[source]

Convenience base class to read CSV or TSV files (using extension)

Constructor

This class takes only one input parameter, however, it may be a filename, or a dataframe or an instance of Reader itself. This means than children classes such as IC50 can also be used as input as long as a dataframe named df can be found.

Parameters:data – a filename in CSV or TSV format with format specified by child class (see e.g. IC50), or a valid dataframe, or an instance of Reader.

The input can be a filename either in CSV (comma separated values) or TSV (tabular separated values). The extension will be used to interpret the content, so please be consistent in the naming of the file extensions.

>>> from gdsctools import Reader, ic50_test
>>> r = Reader(ic50_test.filename) # this is a CSV file
>>> len(r.df)   # number of rows
988
>>> len(r)      # number of elements
11856

Note that Reader is a base class and more sophisticated readers are available. for example, the IC50 would be better to read this IC50 data set.

The data has been stored in a data frame in the df attribute.

The dataframe of the object itself can be used as an input to create an new instance:

>>> from gdsctools import Reader, ic50_test
>>> r = Reader(ic50_test.filename, sep="\t")
>>> r2 = Reader(r) # here r.df is simply copied into r2
>>> r == r2
True

It is sometimes convenient to create an empty Reader that will be populated later on:

>>> r = Reader()
>>> len(r)
0

More advanced readers (e.g. IC50) can also be used as input as long as they have a df attribute:

>>> from gdsctools import Reader, ic50_test
>>> ic = IC50(ic50_test)
>>> r = Reader(ic)
check()[source]

Checking the format of the matrix

Currently, only checks that there is no duplicated column names

header = None

if populated, can be used to check validity of a header

read_data(filename)[source]
to_csv(filename, sep=', ', index=False, reset_index=True)[source]

Save data into a CSV file without indices

class DrugDecode(filename=None)[source]

Reads a “drug decode” file

The format must be comma-separated file. There are 3 compulsary columns called DRUG_ID, DRUG_NAME and DRUG_TARGET. Here is an example:

DRUG_ID     ,DRUG_NAME   ,DRUG_TARGET
999         ,Erlotinib   ,EGFR
1039        ,SL 0101-1   ,"RSK, AURKB, PIM3"

TSV file may also work out of the box. If a column name called ‘PUTATIVE_TARGET’ is found, it is renamed ‘DRUG_TARGET’ to be compatible with earlier formats.

In addition, 3 extra columns may be provided:

- PUBCHEM_ID
- WEBRELEASE
- OWNED_BY

The OWNED_BY and WEBRELEASE may be required to create packages for each company. If those columns are not provided, the internal dataframe is filled with None.

Note that older version of identifiers such as:

Drug_950_IC50

are transformed as proper ID that is (in this case), just the number:

950

Then, the data is accessible as a dataframe, the index being the DRUG_ID column:

data = DrugDecode('DRUG_DECODE.csv')
data.df.iloc[999]

Note

the DRUG_ID column must be made of integer

Constructor

check()[source]
companies
drugIds

return list of drug identifiers

drug_annotations(df)[source]

Populate the drug_name and drug_target field if possible

Parameters:df – input dataframe as given by e.g., anova_one_drug()
Return df:same as input but with the FDR column populated
drug_names
drug_targets
get_info()[source]
get_name(drug_id)[source]
get_public_and_one_company(company)[source]

Return drugs that belong to a specific company and public drugs

get_target(drug_id)[source]
is_public(drug_id)[source]

OmniBEM related

OmniBEM functions

class OmniBEMBuilder(genomic_alteration)[source]

Utility to create GenomicFeatures instance from BEM data

Starting from an example provided in GDSCTools (test_omnibem_genomic_alteration.csv.gz), here is the code to get a data structure compatible with the GenomicFeature, which can then be used as input to the ANOVA class.

See the constructor for the header format.

from gdsctools import gdsctools_data, OmniBEMBuilder, GenomicFeatures

input_data = gdsctools_data("test_omnibem_genomic_alterations.csv.gz")
bem = OmniBEMBuilder(input_data)

# You may filter the data for instance to keep only a set of genes.
# Here, we keep everything
gene_list = bem.df.GENE.unique()
bem.filter_by_gene_list(gene_list)

# Finally, create a MoBEM dataframe
mobem = bem.get_mobem()

# You may filter with other functions(e.g., to keep only Methylation
# features
bem.filter_by_type_list(["Methylation"])
mobem = bem.get_mobem()

# Then, let us create a dataframe that is compatible with
# GenomicFeature. We just need to make sure the columns are correct
mobem[[x for x in mobem.columns if x!="SAMPLE"]]
gf = GenomicFeatures(mobem[[x for x in mobem.columns if x!="SAMPLE"]])

# Now, we can perform an ANOVA analysis:
from gdsctools import ANOVA, ic50_test
an = ANOVA(ic50_test, gf)
results = an.anova_all()
results.volcano()

(Source code, png, hires.png, pdf)

_images/references-6.png

Note

The underlying data is stored in the attribute df.

Constructor

Parameters:genomic_alteration (str) – a filename in CSV format (gzipped or not). The format is explained here below.

The input must be a 5-columns CSV file with the following columns:

COSMIC_ID: an integer
TISSUE_TYPE: e.g. Methylation,
SAMPLE: this should correspond to the COSMID_ID
TYPE: Methylation,
GENE: gene name
IDENTIFIER: required for now but may be removed (rows can be
    used as identifier indeed

Warning

If GENE is set to NA, we drop it. In the resources shown in the example here above, this corresponds to 380 rows (out of 60703). Similarly, if TISSUE is NA, we also drop these rows that is about 3000 rows.

filter_by_cosmic_list(cosmic_list)[source]

Filter the data by cosmic identifers

Parameters:cosmic (list) – the cosmic identifiers to be kept. The data is updated in place.
filter_by_gene_list(genes, minimum_gene=3)[source]

Keeps only required genes

Parameters:
  • genes – a list of genes or a filename (a CSV file with a column named GENE).
  • minimum_gene – genes with not enough entries are removed (defaults to 3)
filter_by_sample_list(sample_list)[source]

Filter the data by sample name

Parameters:tissue (list) – the samples to be kept. The data is updated in place.
filter_by_tissue_list(tissue_list)[source]

Filter the data by tissue

Parameters:tissue (list) – the tissues to be kept. The data is update in place.
filter_by_type_list(type_list)[source]

Filter the data by type

Parameters:tissue (list) – the types to be kept. The data is update in place. Here are some examples of types: Point.mutation, Amplification, Deletion, Methylation.
get_genomic_features()[source]
get_mobem()[source]

Return a dataframe compatible with ANOVA analysis

The returned object can be read by GenomicFeatures.

get_significant_genes(N=20)[source]

Return most present genes

Parameters:N (int) – the maximum number of genes to return
Returns:list of genes with the number of occurences

The genes returned by be used to filter the data:

genes = bem.get_significant_genes(N=20)
bem.filter_by_gene_list(genes.index)
plot_alterations_per_cellline(fontsize=10, width=0.9)[source]

Plot number of alterations

from gdsctools import *
data = gdsctools_data("test_omnibem_genomic_alterations.csv.gz")
bem = OmniBEMBuilder(data)
bem.filter_by_gene_list(gdsctools_data("test_omnibem_genes.txt"))
bem.plot_alterations_per_cellline()

(Source code, png, hires.png, pdf)

_images/references-7.png
plot_number_alteration_by_tissue(fontsize=10, width=0.9)[source]

Plot number of alterations

from gdsctools import *
data = gdsctools_data("test_omnibem_genomic_alterations.csv.gz")
bem = OmniBEMBuilder(data)
bem.filter_by_gene_list(gdsctools_data("test_omnibem_genes.txt"))
bem.plot_number_alteration_by_tissue()

(Source code, png, hires.png, pdf)

_images/references-8.png

This module provides an interface to download and filter GDSC1000 data, as published by Iorio et al (2016).

Author: Elisabeth D. Chen Date: 2016-06-17

class GDSC1000(annotate=False, data_folder_name='./data/gdsc_1000_data/')[source]

Help data retrieval from GDSC website (version v17)

from gdsctools import GDSC1000
data = GDSC1000()
data.download_data()

Next time, starting GDSCTools in the same directory, just type:

data = GDSCTools()
data.load_data()

CNA (e.g., cna_df), Methylation and gene variants are downloaded. IC50 as well and a set of annotations. There are combined in genomic_df.

The genomic_df contains the cna_df, methylation_df and variant_df data. The three latter contains in common the CELL_LINE, ALTERATION_TYPE, IDENTIFIER and TISSUE_FACTOR.

ALTERATION_TYPE can be GENETIC_VARIATION, METHYLATION, DELETION / AMPLIFICATION variant_df also has the COSMIC_ID. The annotation_df gives mapping between identifiers and gene names.

download_data() Download and load data in memory
load_data([annotation]) If CSV files are already downloaded, just load them
make_matrix([min_recurrence]) Return a dataframe compatible with ANOVA analysis

With this class, you may (1) download the data, (2) annotate or filter the data and (3) create a genomic matrix.

To load the data, either download it from the website using download_data() (downloads data from cancerrxgene.org loading methylation, cna, variant and cell lines datasets). It extract relevant columns, re-names column names and saves as csv files in the data_folder_name directory.

If you have already downloaded the data, you may just load it using load_data(). This function also annotates the data with gene information.

Then, you filter the data with one of the filter methods:

filter_by_cell_line([ "AsPC-1", "U-2-OS", "MDA-MB-231"... ] )
filter_by_cosmic_id([ 948121, 2839818, ...] )
filter_by_gene(["ATM", "TP53", ...])
filter_by_recurrence(min_recurrence = 3 )
filter_by_tissue_type([ "BRCA", "COAD/READ", ... ] )
filter_by_alteration_type(["METHYLATION", "AMPLIFICATION",
                                "DELETION", "GENE_VARIANT" ] )

Finally, you can create the genomic matrix using make_matrix().

You can also look at the unique regions for agiven data set. For instance, methylation dataframe has about 2,000 regions but only 338 are unique:

len(data.methylation_df.groupby('IDENTIFIER').groups)

constructor

Parameters:
  • annotate (bool) –
  • data_folder_name (str) –
annotate_all()[source]

Annotates dataframes after read_annotation

download_data()[source]

Download and load data in memory

filter_by_alteration_type(type_list=None)[source]
filter_by_cell_line(cell_line_list=None)[source]
filter_by_cosmic_id(cosmic_list=None)[source]
filter_by_gene(gene_list=None)[source]
filter_by_recurrence(min_recurrence=None)[source]
filter_by_tissue_type(tissue_list=None)[source]
get_cna_info()[source]

Return dataframe with number of genes per unique CNA identifier

get_genomic_info()[source]

Return information about the genomic dataframe

The returned dataframes contains two columns: the number of unique cosmic identifiers and features for each type of alterations.

get_methylation_info()[source]

Return dataframe with number of genes per unique methylation identifier

load_data(annotation=True)[source]

If CSV files are already downloaded, just load them

make_matrix(min_recurrence=None)[source]

Return a dataframe compatible with ANOVA analysis

reset_genomic_data()[source]

Visualisation

Volcano plot

Volcano plot utilities

The VolcanoANOVA is used in the creation of the report but may be used by a usr in a Python shell.

class VolcanoANOVA(data, sep='t', settings=None)[source]

Utilities related to volcano plots

This class is used in gdsctools.anova but can also be used independently as in the example below.

from gdsctools import ANOVA, ic50_test, VolcanoANOVA
an = ANOVA(ic50_test)

# retrict analysis to a tissue to speed up computation
an.set_cancer_type('lung_NSCLC')

# Perform the entire analysis
results = an.anova_all()

# Plot volcano plot of pvalues versus signed effect size
v = VolcanoANOVA(results)
v.volcano_plot_all()

(Source code, png, hires.png, pdf)

_images/references-9.png

Note

Within an IPython shell, you should be able to click on a circle and the title will be updated with the name of the drug/feature and FDR value.

Legend and color conventions:
 The green circles indicate significant hits that are resistant while reds show sensitive hits. Circles are colored if there are below the FDR_threshold AND below the pvalue_threshold AND if the signed effect size is above the effect_threshold.

Constructor

Parameters:
  • data – an ANOVAResults instance or a dataframe with the proper columns names (see below)
  • settings – an instance of ANOVASettings

Expected column names to be found if a filename is provided:

ANOVA_FEATURE_pval
ANOVA_FEATURE_FDR
FEATURE_delta_MEAN_IC50
FEATURE_IC50_effect_size
N_FEATURE_pos
N_FEATURE_pos
FEATURE
DRUG_ID

If the plotting is too slow, you can use the selector() to prune the results (most of the data are noise and overlap on the middle bottom area of the plot with little information.

savefig(filename, size_inches=(10, 10))[source]
selector(df, Nbest=1000, Nrandom=1000, inplace=False)[source]

Select only the first N best rows and N random ones

Sometimes, there are tens of thousands of associations and future analysis will include more features and drugs. Plotting volcano plots should therefore be fast and scalable. Here, we provide a naive way of speeding up the plotting by selecting only a subset of the data made of Nbest+Nrandom associations.

Parameters:
  • df – the input dataframe with ANOVAResults
  • Nbest (int) – how many of the most significant association should be kept
  • Nrandom (int) – on top of the Nbest significant association, set how many other randomly chosen associations are to be kept.
Returns:

pruned dataframe

volcano_plot_all()[source]

Create an overall volcano plot for all associations

This method saves the picture in a PNG file named volcano_all.png.

volcano_plot_all_drugs()[source]

Create a volcano plot for each drug and save in PNG files

Each filename is set to volcano_<drug identifier>.png

volcano_plot_all_features()[source]

Create a volcano plot for each feature and save in PNG files

Each filename is set to volcano_<feature name>.png

volcano_plot_one_drug(drug_id)[source]

Volcano plot for one drug (all genomic features)

Parameters:drug_id – a valid drug identifier to be found in the results
volcano_plot_one_feature(feature)[source]

Volcano plot for one feature (all drugs)

Parameters:feature – a valid feature name to be found in the results
class VolcanoANOVAJS(data, sep='t', settings=None)[source]
get_fdr_ypos()[source]
render_all()[source]
render_drug(name)[source]
render_feature(name)[source]

Boxplot and beeswarm

boxswarm(data, names=None, vert=True, widths=0.5, **kwargs)[source]

Plot boxplot with all points as circles.

This function is a wrapper of BoxSwarm

Parameters:
  • data – a dataframe. Each column is a data set from which a boxplot is created.
  • names
  • vert – orientation of the boxplots
  • widths – widths of the boxes
  • kargs – any argument accepted by BoxSwarm

See BoxSwarm documentation for details

class BoxSwarm(data, names=None, fontsize=20, hold=False, title='', lw=2, colors=['lightgrey', 'blue'])[source]

Simple beeswarm plot (boxplot + dots for each data point)

from pylab import randn
from gdsctools.boxswarm import BoxSwarm
b = BoxSwarm({'a':randn(100), 'b':randn(20)+2})
b.plot(vert=False)

(Source code, png, hires.png, pdf)

_images/references-10.png

Note

could use pybeeswarm, which is a proper implementation of beeswarm.

Constructor

Param:

a list of list (not same size) or a dictionary of lists

Parameters:
  • data
  • names
  • fontsize
  • hold
  • title
  • lw – width of lines
  • colors – loop over the list of colors provided to fill boxplots
  • **kargs
beeswarm(data, position, ratio=2.0)[source]

Naive plotting of the data points

We assume gaussian distribution so we expect fewers dots far from the mean/median. We’d like those dots to be close to the axes. conversely, we expect lots of dots centered around the mean, in which case, we’d like them to be spread in the box. We uniformly distribute position using

X = X + \dfrac{ U()-0.5 }{ratio} \times factor

but the factor is based on an arctan function:

factor = 1 - \arctan( \dfrac{X - \mu }{\pi/2})

The farther the data is from the mean \mu, the closest it is to the axes that goes through the box.

plot(vert=True, alpha=0.4, widths=0.5, **kwargs)[source]

Plot the boxplots and dots

Code related to the ANOVA analysis to find associations between drug IC50s and genomic features

class BoxPlots(odof, fontsize=20, savefig=False, directory='.')[source]

Box plot for a given association of drug versus genomic feature

from gdsctools import ANOVA, ic50_test
from gdsctools.boxplots import BoxPlots

gdsc = ANOVA(ic50_test)

# Perform the entire analysis
odof = gdsc._get_one_drug_one_feature_data(1047, 'TP53_mut')

# Plot volcano plot of pvalues versus signed effect size
bx = BoxPlots(odof)
bx.boxplot_association()

(Source code, png, hires.png, pdf)

_images/references-11.png

If the gdsc analysis has the MSI factor and tissue factor on, then additional plots can be created using boxplot_pancan().

Note that odof in the example above is a dictionary with the following keys:

  • drug_name
  • feature_name
  • masked_tissue: a dataframe with cosmic ids as index and 1 column of tissues names.
  • Y: list with the IC50s
  • masked_features: a dataframe with cosmic ids as index and 1 column of masked feature (1/0)
  • masked_msi: same as masked_features
  • negatives: subset of the IC50s corresponding to positive feature
  • positives: subst of the IC50s corresponding to negative feature

Constructor

boxplot_association(fignum=1)[source]

Boxplot of the association (negative versus positive)

Parameters:fignum – number of the figure
boxplot_pancan(mode, fignum=1, title_prefix='')[source]

Create boxplot related to the MSI factor or Tissue factor

Parameters:mode – either set to msi or tissue
directory = None

directory where to save the figure.

fontsize = None

fontsize for the plots

lw = None

linewidth used in the plots

odof = None

dictionary as returned by ANOVA._get_one_drug_one_feature_data

savefig = None

boolean to save figure

class BoxPlotsJS(odof, fontsize=20, savefig=False, directory='.')[source]
get_html_association()[source]
get_html_media()[source]
get_html_msi()[source]
get_html_tissue()[source]

reports

Base classes to create HTML reports easily

class ReportMain(filename='index.html', directory='report', overwrite=True, verbose=True, template_filename='index.html', mode=None, init_report=True)[source]

A base class to create HTML pages

This Report class holds the CSS and HTML layout and will ease the creation of new reports and HTML pages. For instance, it will add a footer and header automatically, save files in the proper directory, create the directory if it is missing, copy CSS and JS files in the directory automatically.

from gdsctools import Report
r = Report()
r.add_section('Example with some text', 'Example' )
r.report(onweb=True)

Note

For developers the original CSS and JS files are stored in the share/data directory.

The idea is that you create sections (text + title) that you add little by little in your HTML documents. Then, you create the report. The report will add a footer, header, table of contents before the sections. The text of a section can contain any HTML document.

Constructor

Parameters:
  • filename – default to index.html
  • directory – defaults to report
  • overwrite – default to True
  • verbose – default to True
  • dependencies – add the dependencies table at the end of the document if True.
  • mode (str) – if none, report have a structure that contains the following directories: OUTPUT, INPUT, js, css, images, code. Otherwise, if mode is set to ‘summary’, only the following directories are created: js, css, images
create_report(onweb=True)[source]
show()[source]

Opens a tab in a browser to see the document

Data-related

class TCGA[source]

A dictionary-like container to access to TCGA cancer types keywords

>>> from gdsctools import TCGA
>>> tt = TCGA()
>>> tt.ACC
'Adrenocortical Carcinoma'
TCGA_GDSC1000 = ['BLCA', 'BRCA', 'COREAD', 'DLBC', 'ESCA', 'GBM', 'HNSC', 'KIRC', 'LAML', 'LGG', 'LIHC', 'LUAD', 'LUSC', 'OV', 'PAAD', 'SKCM', 'STAD', 'THCA']

TCGA keys used in GDSC1000

class Tissues[source]

List of tissues included in various analysis

Contains tissues included e.g in v17,v18

Data sets provided with GDSCTools

The datasets may be for testing purpose:

  • ic50_test
  • drug_test
  • cosmic_builder_test

or informative:

  • cancer_cell_lines
  • cosmic_info

or used in analysis:

  • genomic_features_test
  • ic50_v17: IC50s for 1001 cell lines
  • gf_v17: dataset with genomic features for 1001 cell lines and 680 features (mutation, CNA)
  • ic50_v5
  • gf_v5
class Data(filename=None, description='No description', authors='GDSC consortium')[source]

A convenience class to hold information about a dataset

Each Data instance contains information about :

  1. the file location (filename)
  2. the data description (description)
  3. the authors (authors)

But the data is not stored and users must read the data set using their own tools.

authors = None

list of authors (string)

description = None

a short description (string)

filename = None

where is located the data set (full path)

location
class COSMICFetcher(filename=None)[source]

Utility to download a flat file about cosmic identier and build a small dataframe with cosmic identifiers and their diseases

The original flat file is downloaded from ftp.expasy.org/databases and contains records as follows:

ID         Identifier (cell line name)     Once; starts an entry
AC         Accession (CVCL_xxxx)           Once
SY         Synonyms                        Optional; once
DR         Cross-references                Optional; once or more
RX         References identifiers          Optional: once or more
WW         Web pages                       Optional; once or more
CC         Comments                        Optional; once or more
DI         Diseases                        Optional; once or more
OX         Species of origin               Once or more
HI         Hierarchy                       Optional; once or more
OI         Originate from same individual  Optional; once or more
SX         Sex (gender) of cell            Optional; once
CA         Category                        Optional; once

We keep only records with COSMIC cross references. From those records, we keep ID, AC, CA, DI (Disease) and the cosmic identifier.

The resulting dataframe can then be accessed in the df attribute.

>>> from gdsctools.cosmictools import COSMICFetcher
>>> cf = COSMICFetcher() # this may take a while to download the file
>>> cf.df.loc[0]
ID                                       OS-A
AC                                  CVCL_0C23
CA                           Cancer cell line
COSMIC_ID                             2239090
Disease      C4917; Small cell lung carcinoma
Name: 0, dtype: object

Constructor

Parameters:filename (str) – If not provided, download file from expasy.org and store it in data. Otherwise, if filename is provided, reads a local file. Format should be the same as the file downloaded from expasy
class COSMICInfo[source]

Retrieve information about cell line included in GDSC1000

This file reads a GDSCTools dataset gdsctools.datasets.cosmic_info. Its content is stored in df.

In corresponds to Table S1E (List cell line samples with data availability and annotations across the different omics

The method get() retrieves information contained in the dataframe df. Provide a known cosmic identifier as follows:

>>> from gdsctools import COSMICInfo
>>> c = COSMICInfo()
>>> c.get(909907, 'SAMPLE_NAME')
'ZR-75-30'

or get all available field as follows:

>>> c.get(909907)
SAMPLE_NAME           ZR-75-30
SEQ                          1
CNA                          1
EXP                          1
MET                          1
DRUG_SCR                     1
GDSC_description_1      breast
GDSC_description_2      breast
Study_Abbreviation        BRCA
MMR                      MSI-L
SCREEN_MEDIUM                R
GROWTH_PROPERTIES     Adherent
Name: 909907, dtype: object

Note

there are only 1000 cell lines in the df. Additional cell lines may be retrieve using COSMICFetcher

If a cosmic identifier is not found, the returned object has the same structure as above but with all fields set to False.

constructor

df = None

dataframe with all information

get(identifier, colname=None)[source]
Parameters:
  • identifier (int) – a cosmic identifiers. Possible values are stored in df.index attribute
  • colname – specific field.
Returns:

if colname is not provided, returns a time series for the identifier with all available fields. Otherwise, returns a specific field.

on_web(identifier)[source]

Open a tab related to the COSMIC identifier (in your browser)

Logistics

Sets of miscellaneous tools

class Logistic(xmid, scale, Asym=1, N=9, increase=False)[source]

Simple logistic class to see the curve implied by xmid/scale parameters

from gdsctools.logistics import Logistic
from pylab import legend
tl = Logistic(2, 1)
tl.plot()
tl.scale = 4
tl.plot(hold=True)
legend(['scale=1', 'scale=4'])

(Source code, png, hires.png, pdf)

_images/references-12.png

The X values are set automatically given the number of data points N and the minimum and maximum X values. By default a sensible values for the minimun and maximum x values are guessed based on scale parameter but one can set the xmin and xmax

from gdsctools.logistics import Logistic
from pylab import legend
tl = Logistic(2,1)
tl.plot()
tl.xmin= -4
tl.xmax = 4
tl.N = 40
tl.plot(hold=True)
legend(['default', 'user defined'])

(Source code, png, hires.png, pdf)

_images/references-13.png

Constructor

Parameters:
  • xmid – the first logistic function parameter
  • scale – the second logistic function parameter
  • Asym – Amplitude of the function
  • N – number of data point
  • increase – increasing or decreasing function.
  • xmin – starting range of x-values
  • xmax – ending range of x-values

if increase is False:

L(x) = \frac{Asym}{1 + exp ((xmid-X)/scale)}

if increase is True

L(x) = 1 - \frac{Asym}{1 + exp ((xmid-X)/scale)}

Changing xmin, xmax or N does change the content of X. You can change X directly.

N

set/get the number points

X

get/set of the x-values

Y

Getter for Y-values

plot(hold=False)[source]

Plot the logistic function

scale
xmax

set/get the maximum x range

xmin

set/get the minimum x range

class LogisticMatchedFiltering(xmid, scale, N=9)[source]

Experimental class to identify parameters of a noisy Logistic function

This class implements two methods to identify the parameters (xmid and scale) of a logistic function.

Note One constraint is to define the x-values.

from gdsctools import logistics
import pylab

mf = logistics.LogisticMatchedFiltering(1,2)
mf.scan(pylab.linspace(-3,3, 10), pylab.linspace(0,5,10))

(Source code)

constructor

Parameters:
  • xmid
  • scale
  • N
N
X

get/set the X values the logistic signal

get_snr(template, show=True)[source]
objective_function(xmid, scale)[source]
optimise(guess=[1, 1])[source]
plot(results, coords, best, xmid_range, scale_range)[source]
scale

get/set the scale value of the logistic signal

scan(xmid_range, scale_range, method='mf', show=True)[source]
set_noisy_data(sigma=0.1)[source]
xmid

get/set the xmid value of the logistic signal

Regression Analysis

Common Regression class

class Regression(ic50, genomic_features=None, verbose=False)[source]

Base class for all Regression analysis

In the gdsctools.anova.ANOVA case, the regression is based on the OLS method and is computed for a given drug and a given feature (ODOF). Then, the analysis is repeated across all features for a given drug (ODAF) and finally extended to all drugs (ADAF). So, there is one test for each combination of drug and feature.

Here, all features for a given drug are taken together to perform a Regression analysis (ODAF). The regression algorithm implemented so far are:

  • Ridge
  • Lasso
  • ElasticNet
  • LassoLars

Based on tools from the scikit-learn library.

Constructor

Parameters:
  • ic50 – an IC50 file
  • genomic_features – a genomic feature file

see Data Format and Readers for help on the input data formats.

boxplot(drug_name, model, n=5, minimum_match_per_combo=10, bx_vert=True, bx_alpha=0.5, verbose=False, max_label_length=40)[source]

Boxplot plot of the most important features.

Parameters:
  • n (int) – maximum most important features to be shown (in term of coefficient/weight used in the model)
  • minimum_match_per_combo (int) – minimum number of alterations required for a feature to be shown
# assuming there is a drug ID = 29
r = GDSCLasso()
model = r.get_best_model(29)
r.boxplot(29, model)

In addition, we also show the wild type case where the total number of mutations is 0.

check_randomness(drug_name, kfolds=10, N=10, progress=False, nbins=40, show=True, **kargs)[source]

Compute Bayes factor between NULL model and best model fitted N times

Parameters:
  • drug_name
  • kfolds
  • N (int) – optimise NULL models and real model N times
  • progress
  • nbins
  • show

Bayes factor:

S = sum([s>r for s,r in zip(scores, random_scores)])
proba = S / len(scores)
bayes_factor = 1. / (1-proba)

Interpretation for values of the Bayes factor according to Kass and Raftery (1995).

Interpretation B(1,2)
Very strong support for 1 < 0.0067
Strong support 1 0.0067 to 0.05
Positive support for 1 0.05 to .33
Weak support for 1 0.33 to 1
No support for either model 1
Weak support for 2 1 to 3
Positive support for 2 3 to 20
Strong support for 2 20 to 150
Very strong support for 2 > 150
references: http://www.socsci.uci.edu/~mdlee/LodewyckxEtAl2009.pdf
http://www.aarondefazio.com/adefazio-bayesfactor-guide.pdf
dendogram_coefficients(stacked=False, show=True, cmap='terrain')[source]

shows the coefficient of each optimised model for each drug

This works for demonstration and small data sets.

fit(drug_name, alpha=1, l1_ratio=0.5, kfolds=10, show=True, tol=0.001, normalize=False, shuffle=False, perturbation=0.01, randomize_Y=False)[source]

Run Elastic Net with a cross validation for one value of alpha

Parameters:
  • drug_name – the drug to analyse
  • alpha (float) – note that theis alpha parameter corresponds to the lambda parameter in glmnet R package.
  • l1_ratio (float) – This is the lasso penalty parameter. Note that in scikit-learn, the l1_ratio correspond to the alpha parameter in glmnet R package. l1_ratio set to 0.5 means that there is a weight equivalent for the Lasso and Ridge effects.
  • kfolds (int) – defaults to 10
  • shuffle – shuffle the indices in the KFold
Returns:

kfolds scores for each fold. The score is the pearson correlation.

Note

l1_ratio < 0.01 is not reliable unless sequence of alpha is provided.

Note

alpha = 0 correspond to an OLS analysis

get_best_model(drug_name, kfolds=10, alphas=None, l1_ratio=0.5)[source]

Return best model fitted using a CV

Parameters:
  • drug_name
  • kfolds
  • alphas
  • l1_ratio
plot_importance(drug_name, model=None, fontsize=11, max_label_length=35, orientation='vertical')[source]

Plot the absolute weights found by a fittd model.

Parameters:
  • drug_name (str) –
  • model – a model
  • fontsize (int) – (defaults to 11)
  • max_label_length – 35 by default
  • orientation – orientation of the plot (vertical or horizontal)
Returns:

the dataframe with the weights (may be empty)

Note

if no weights are different from zeros, no plots are created.

plot_weight(drug_name, model=None, fontsize=12, figsize=(10, 7), max_label_length=20, Nmax=40)[source]

Plot the elastic net weights

Parameters:
  • drug_name – the drug identifier
  • alpha
  • l1_ratio

Large alpha values will have a more stringent effects on the weigths and select only some of them or maybe none. Conversely, setting alphas to zero will keep all weights.

from gdsctools import *
ic = IC50(gdsctools_data("IC50_v5.csv.gz"))
gf = GenomicFeatures(gdsctools_data("genomic_features_v5.csv.gz"))
en = GDSCElasticNet(ic, gf)
model = en.get_model(alpha=0.01)
en.plot_weight(1047, model=model)

(Source code, png, hires.png, pdf)

_images/references-15.png
runCV(drug_name, l1_ratio=0.5, alphas=None, kfolds=10, verbose=True, shuffle=True, randomize_Y=False, **kargs)[source]

Perform the Cross validation to get the best alpha parameter.

Returns:an instance of RegressionCVResults that contains alpha parameter and Pearson correlation value.
tune_alpha(drug_name, alphas=None, N=80, l1_ratio=0.5, kfolds=10, show=True, shuffle=False, alpha_range=[-2.8, 0.1], randomize_Y=False)[source]

Interactive tuning of the model (alpha).

This is much faster than plot_cindex() but much slower than runCV().

from gdsctools import *
ic = IC50(gdsctools_data("IC50_v5.csv.gz"))
gf = GenomicFeatures(gdsctools_data("genomic_features_v5.csv.gz"))

en = GDSCElasticNet(ic, gf)

en.tune_alpha(1047, N=40, l1_ratio=0.1)

(Source code, png, hires.png, pdf)

_images/references-16.png

Lasso

class GDSCLasso(ic50, genomic_features=None, verbose=False)[source]

See as Regression

get_model(alpha=1, l1_ratio=None, **kargs)[source]

ElasticNet

class GDSCElasticNet(ic50, genomic_features=None, verbose=False, set_media_factor=False)[source]

Variant of ANOVA that handle the association at the drug level using an ElasticNet analysis of the IC50 vs Feature matrices.

As compared to the GDSCRidge and GDSCLasso

Here is an example on how to perform the analysis, which is similar to the ANOVA API:

from gdsctools import GDSCElasticNet, gdsctools_data, IC50, GenomicFeatures
ic50 = IC50(gdsctools_data("IC50_v5.csv.gz"))
gf = GenomicFeatures(gdsctools_data("genomic_features_v5.csv.gz"))

en = GDSCElasticNet(ic50, gf)
en.enetpath_vs_enet(1047)

(Source code, png, hires.png, pdf)

_images/references-17.png

For more information about the input data sets please see ANOVA, readers

Constructor

Parameters:
  • IC50 (DataFrame) – a dataframe with the IC50. Rows should be the COSMIC identifiers and columns should be the Drug names (or identifiers)
  • features – another dataframe with rows as in the IC50 matrix and columns as features. The first 3 columns must be named specifically to hold tissues, MSI (see format).
  • verbose – verbosity in “WARNING”, “ERROR”, “DEBUG”, “INFO”
enetpath_vs_enet(drug_name, alphas=None, l1_ratio=0.5, nfeat=5, max_iter=1000, tol=0.0001, selection='cyclic', fit_intercept=False)[source]

#if X is not scaled, the enetpath and ElasticNet will give slightly # different results #if X is scale using:

from sklearn import preprocessing
xscaled = preprocessing.scale(X)
xscaled = pd.DataFrame(xscaled, columns=X.columns)
get_model(alpha=1, l1_ratio=0.5, tol=0.001, normalize=False, **kargs)[source]
plot_cindex(drug_name, alphas, l1_ratio=0.5, kfolds=10, hold=False)[source]

Tune alpha parameter using concordance index

This is longish and performs the following task. For a set of alpha (list), run the elastic net analysis for a given l1_ratio with kfolds. For each alpha, get the CIndex and find the CINdex for which the errors are minimum.

Warning

this is a bit longish (300 seconds for 10 folds and 80 alphas) on GDSCv5 data set.

Ridge

class GDSCRidge(ic50, genomic_features=None, verbose=False)[source]

Same as Regression

get_model(alpha=1, l1_ratio=None, **kargs)[source]

Report

Code related to the Regression analysis to find associations between drug IC50s and genomic features

class RegressionReport(method, directory='.', verbose=True, image_dir='images', data_dir='data', config={'boxplot_n': '?'})[source]

Class used to interpret the results and create final HTML report

Constructor

Parameters:
  • method – Method used in the regression analysis (lasso, elasticnet, ridge)
  • results
create_html_drug()[source]

report for each individual drug

create_html_main(onweb=False)[source]

Create HTML main document (summary)

Data Packages

class IC50Cluster(ic50, ratio_threshold=10, verbose=True, cluster=True)[source]

Utility to cluster columns that correspond to the same drug ID

From GDSC v18 data sets onwards, DRUG identifiers may be duplicated to account for different drug concentration. This is not recommended since we’d rather use unique identifier for different experiments but to account for this feature, the IC50Cluster will rename the columns and transform the data as follows.

Consider the case of the DRUG 1211. It appears 3 times in the original data:

Drug_1211_0.15625_IC50
Drug_1211_1_IC50
Drug_1211_10_IC50

Actually, there are about 15 such cases even though in general there are only 2 duplicates:

    DRUG_ID   1   2    3  total  icommon       ratio
21     1782  47  47  NaN     47      47  100.000000
20     1510  45  45  NaN     45      45  100.000000
19     1211  48  47  4.0     50      46   92.000000
18     1208  48  47  NaN     50      45   90.000000
16     1032  43  47  NaN     50      40   80.000000
17     1207  38  47  NaN     50      35   70.000000
13      231   2  39  NaN     39       2    5.128205
14      232  45   2  NaN     45       2    4.444444
10      226   2  45  NaN     45       2    4.444444
12      230   2  46  NaN     46       2    4.347826
15      238   2  46  NaN     46       2    4.347826
0       206   2  46  NaN     46       2    4.347826
1       211  46   2  NaN     46       2    4.347826
9       224   2  46  NaN     46       2    4.347826
8       223   2  46  NaN     46       2    4.347826
7       221   2  46  NaN     46       2    4.347826
6       217   2  46  NaN     46       2    4.347826
5       216   2  46  NaN     46       2    4.347826
4       215   2  46  NaN     46       2    4.347826
3       214   2  46  NaN     46       2    4.347826
2       213   2  46  NaN     46       2    4.347826
11      229   2  46  NaN     46       2    4.347826

The clustering works as follows. If the ratio of drugs in common between several concentrations is large, then they are studied independently. Otherwise they are merged.

In the final dataframe, the columns names are transformed into unique identifiers like in the IC50 class by removing the Drug_ prefix and ``_conc_IC50 suffix.

The mapping contains the mapping between new and old identifiers.

See also

cleanup() method.

constructor

Parameters:
  • ic50
  • ratio_threshold (int) –
  • verbose (bool) –
  • cluster (bool) – may be useful to not cluster the data for testing or debugging
cleanup(offset=10000)[source]

Rename the columns into unique identifiers

Parameters:offset (int) – if duplicated, add the offset

The mapping contains the mapping, which should be used to update the decoder file.

cluster()[source]
duplicated
to_cluster
class GDSC(ic50, drug_decode, genomic_feature_pattern='GF_*csv', main_directory='tissue_packages', verbose=True)[source]

Wrapper of the ANOVA class and reports to analyse all TCGA Tissues and companies automatically while creating summary HTML pages.

First, one need to provide an unique IC50 file. Second, the DrugDecode file (see DrugDecode) must be provided with the DRUG identifiers and their corresponding names. Third, a set of genomic feature files must be provided for each TCGA tissue.

You then create a GDSC instance:

from gdsctools import GDSC
gg = GDSC('IC50_v18.csv', 'DRUG_DECODE.txt',
    genomic_feature_pattern='GF*csv')

At that stage you may want to change the settings, e.g:

gg.settings.FDR_threshold = 20

Then run the analysis:

gg.analysis()

This will launch an ANOVA analysis for each TCGA tissue + PANCAN case if provided. This will also create a data package for each tissue. The data packages are stored in ./tissue_packages directory.

Since all private and public drugs are stored together, the next step is to create data packages for each company:

gg.create_data_packages_for_companies()

you may select a specific one if you wish:

gg.create_data_packages_for_companies(['AZ'])

Finally, create some summary pages:

gg.create_summary_pages()

You entry point is an HTML file called index.html

Constructor

Parameters:
  • ic50 – an IC50 file.
  • drug_decode – an DrugDecode file.
  • genomic_feature_pattern – a glob to a set of GenomicFeature files.
analyse(multicore=None)[source]

Launch ANOVA analysis and creating data package for each tissue.

Parameters:
  • onweb (bool) – By default, reports are created but HTML pages not shown. Set to True if you wish to open the HTML pages.
  • multicore – number of cpu to use (1 by default)
companies
create_data_packages_for_companies(companies=None)[source]

Creates a data package for each company found in the DrugDecode file

create_summary_pages()[source]

Create summary pages

Once the main analyis is done (analyse()), and the company packages have been created (create_data_packages_for_companies()), you can run this method that will creade a summary HTML page (index.html) for the tissue, and a similar summary HTML page for the tissues of each company. Finally, an HTML summary page for the companies is also created.

The final tree direcorty looks like:

|-- index.html
|-- company_packages
|   |-- index.html
|   |-- Company1
|   |   |-- Tissue1
|   |   |-- Tissue2
|   |   |-- index.html
|   |-- Company2
|   |   |-- Tissue1
|   |   |-- Tissue2
|   |   |-- index.html
|-- tissue_packages
|   |-- index.html
|   |-- Tissue1
|   |-- Tissue2
tcga

MISC

Sets of miscellaneous tools

class Savefig(verbose=False)[source]

A simple class to save matploltib figures in the proper place

Note

For developers only

directory

directory where to save figures

savefig(name, size_inches=None, **kargs)[source]

Save a matplotlib figure

Parameters:
  • filename (str) – where to save the figure.
  • kargs – accepts all parameters known by pylab.savefig

Code related to the ANOVA analysis to find associations between drug IC50s and genomic features

class ANOVASettings(**kargs)[source]

All settings used in gdsctools.anova.ANOVA analysis

This class behaves as a dictionary but values for a given key (setting) can be accessed and changed easily like an attribute:

>>> from gdsctools import ANOVASettings
>>> s = ANOVASettings()
>>> s.FDR_threshold
25
>>> s.FDR_threshold = 20

It is the responsability of the users or developers to check the validity of the settings by calling the check() method. Note, however, that this method does not perform exhaustive checks.

Finally, the method to_html() creates an HTML table that can be added to an HTML report.

Note

for developers a key can be changed or accessed to as if it was an attribute. This prevents some functionalities (such as copy() or property) to be used normaly hence the creation of the check() method to check validity of the values rather than using properties.

Here are the current values available.

Name Default Description
include_MSI_factor True Include MSI in the regression
feature_factor_threshold 3 Discard association where a genomic feature has less than 3 positives or 3 negatives values (e.g., 0, 1 or 2)
MSI_factor_threshold 2 Discard association where a MSI count has less than 2 positives or 2 negatives values (e.g., 0, or 1).
analysis_type PANCAN Type of analysis. PANCAN means use all data. Otherwise, you must provide a valid tissue name to be found in the Genomic Feature data set.
pvalue_correction_method fdr Type of p-values correction method used. Could be fdr, qvalue or one accepted by MultipleTesting
pvalue_correction_level True Apply pvalue correction globally. If False, applied to ‘drug_level’ only.
equal_var_ttest True Assume equal variance in the t-test
minimum_nonna_ic50 6 Minimum number of IC50 required to perform an analysis for a given drug.
fontsize 25 Used in some plots for labels
FDR_threshold 25 FDR threshold used in volcano plot and significant hits
pvalue_threshold 0.001 Used to select significant hits see ANOVAReport
directory html_gdsc_anova Directory where images and HTML documents are saved.
savefig False Save the figure or not (PNG format)
effect_threshold 0 Used in the volcano plot. See VolcanoPlot

There are parameters dedicated to the regression method. Note that only regression_formula is effective right now.

Name Default Description
regression_method OLS Regression method amongst OLS. NOT USED YET.
regression_alpha 0.01 Fraction of penalty included. If 0, equivalent to OLS. NOT USED YET.
regression_L1_wt 0.5 Fraction of the penalty given to L1 penalty term. Must be between 0 and 1. If 0, equivalent to Ridge. If 1 equivalent to Lasso. NOT USED YET.
regression_formula auto if auto, use standard regression from GDSCTools (see link_formula) otherwise any valid regression formula can be used.

See also

About the settings or gdsctools.readthedocs.org/en/master/settings.html#filtering decrease the number of significant hits.

check()[source]

Checks the values of the parameters

This may not be exhaustive. Right now, checks

  • MSI factor is boolean.
  • Regression.method in OLS/Ridge/Lasso/ElasticNet
  • FDR thresohld in [0,1]
  • pvalues_threshold in [0,inf[
  • effect_threshold in [0,inf[
  • pvalue_correction_method
  • etc
to_html()[source]

Convert the sets of parameters into a nice HTML table

Small functionalities to retrive chembl/chemspider identifiers based on a drug name

class ChemSpiderSearch(drug_decode)[source]

This class uses ChemSpider and ChEMBL to identify drug name

Warning

this is a draft version in dev mode

c = ChemSpiderSearch()
c.search_in_chemspider()
c.search_from_smile_inchembl()
df = c.find_chembl_ids()

It happens that most of public names can be found and almost none of non-public are found. As expected…

If chemspider, chembl and pubchem are empty, search for the drug name in chemspider.

CHEMSPIDER search:
if no identifier found, the search if DROPPED if 1 identifier found, we keep going using the SMILE identifier If more than 1 identifier found, this is AMBIGUOUS.

If chembl and pubchem, check with unichem If chembl, check smiles If chembl and chemspider, check smiles ?

SMILES are not unique

filling_chembl_pubchem_using_unichem()[source]
find_chembl_ids()[source]
get_chemspider_ids(drug_name)[source]
search_from_smile_inchembl()[source]
search_in_chemspider()[source]

Code related to the ANOVA analysis to find associations between drug IC50s and genomic features

class BaseModels(ic50, genomic_features=None, drug_decode=None, verbose=True, set_media_factor=False)[source]

A Base class for ANOVA / ElaticNet models

Constructor

Parameters:
  • IC50 (DataFrame) – a dataframe with the IC50. Rows should be the COSMIC identifiers and columns should be the Drug names (or identifiers)
  • features – another dataframe with rows as in the IC50 matrix and columns as features. The first 3 columns must be named specifically to hold tissues, MSI (see format).
  • drug_decode – a 3 column CSV file with drug’s name and targets see readers for more information.
  • verbose – verbosity in “WARNING”, “ERROR”, “DEBUG”, “INFO”

The attribute settings contains specific settings related to the analysis or visulation.

cosmicIds

get/set the cosmic identifiers in the IC50 and feature matrices

diagnostics(details=False)[source]

Return dataframe with information about the analysis

drugIds

Get/Set drug identifers

feature_names

Get/Set feature names

init()[source]
read_drug_decode(filename=None)[source]

Read file with the DRUG information

read_settings(settings)[source]

Read settings and update cancer type if set

set_cancer_type(ctype=None)[source]

Select only a set of tissues.

Input IC50 may be PANCAN (several cancer tissues). This function can be used to select a subset of tissues. This function changes the ic50 dataframe and possibly the feature as well if some are not relevant anymore (sum of the column is zero for instance).

settings = None

an instance of ANOVASettings

exception GDSCToolsDuplicatedDrugError(drug_id)[source]
exception GDSCToolsError(value)[source]

For developers

The GDSCTools test suite

The GDSCTools package has a large set of tests in the ./test/gdsctools directory. In version 0.10, about 80% of the functionalities are covered. If you add a new module or function, please add a corresponding test file in ./test/gdsctools directory.

The test suite uses the pytest utility and is integrated within the Travis continuous integration (see below).

To run the tests locally, you will need nose and coverage packages:

pip install pytest

Then, in the directory where is the setup.py file, type

pytest

You may also go directly in the ./test/gdsctools directory but some tests may required extra files or permission.

Documentation

The documentation is based on Sphinx. This means that all code is documented using the REST syntax. Docstring are added in classes and functions as much as possible with code examples.

See for example the file in gdsctools/anova.py and in particular the class ANOVA.

In addition, in the ./doc directory there are a set of files encoded using the REST syntax. If you go to that directory and type:

make html

then the entire documentation included Tutorial and Developer guide will be parsed and interpreted. The resulting html documentation can then be found in doc/build/html. For the Sphinx documentation to be generated, the GDSCTools package must be installed.

Note

the command above will work only if GDSCTools source code is available and installed in the shell where the command is executed.

The documentation can then be uploaded on pypi. Go to the directory were is the file setup.py and type:

python setup.py upload_docs

Continuous Integration

This is based on Travis. The reports should be available here: https://travis-ci.org/CancerRxGene/gdsctools and the status is also reported in the main github page (https://github.com/CancerRxGene/gdsctools) as an icon (build) that will be green or red depending on the status of the build within Travis.

The required files is in the main directory and called .travis.yml

ReadTheDocs

The doc is built on read the docs. The required files are readthedocs.yml and requirements.txt.

Issues

Please fill bug report in https://github.com/CancerRxGene/gdsctools/issues

Contributions

Please join https://github.com/CancerRxGene/gdsctools

ChangeLog

Version 1.0.1

  • fix the MANIFEST to include requirements.txt

Version 1.0.0

  • Fix tests to use pytest
  • Fix travis accordingly
  • Add a Singularity recipe

Version 0.20.0

  • NEWS:

    • update the Snakemake pipeline for the regression and anova analysis.
    • fix doc
    • More tests
    • Fix versions on some dependencies (e.g. sklearn to 0.18.1)

Version 0.19.0

  • CHANGES:

    • Update GDSCTools to use latest Pandas version (0.20); in particular the .ix getter is deprecated and should be replaced by loc/ilc
    • pin version of pandas to 0.20 and colormap to 1.0.1 in the setup

Version 0.18.0

Version 0.17.1

  • NEWS:

    • add test for module gdsc1000, which has been refactored and cleanup
  • BUGS:

    • Fix GDSC links to fetch drug/genomic feature data sets
    • Fix bug in regression.rules pipeline (create missing directory)

Version 0.17

summary:Fixing bunch of deprecated warnings, working regression pipeline based on snakemake
  • CHANGES and BUG Fixes:

    • regression (snakemake pipeline)
    • add missing init and regression modules
    • starting to use colorlog
  • NEWS:

    • add regression script to create the snakefile workflow easily

Version 0.16.4

  • NEWS:

    • a snakemake pipeline for the regression analysis
    • Report for the regression analysis

Version 0.16.3

  • BUG Fix: #158 infinity are removed in ANOVAReport in the constructor

Version 0.16.2

  • BUG Fix:

    • Fix volcano plot and anova results (avoid infinite values)

Version 0.16.1

  • BUG Fixes

    • Fix regression dendogram plot
    • Fix bug leading to NA in effect size reported by Carlos P. (private communication)”
  • CHANGES:

    • Omnibem: drops the NA instead of replacing by “”

Version 0.16 (Nov 2016)

  • BUG Fixes

    • Fix pending issue related to #153 ic_input.csv”, “test_gf_input.csv” for the case where some media factor are missing. This is now handle properly.
    • Fix #151 : large integer are not cast properly with consequence that indices are strings, not integers leading to further issues in the HTML pages.
  • NEWS:

    • anova_one_drug_one_feature_custom allows to perform any regression using formula like in R. This is not for production but should be useful to perform custom analysis
    • Include the MEDIA factor boxplot in the library and reports

Version 0.15 (Nov 2016)

  • BUG Fixes

    • Fix issue #156 (GDSC failures in some cases). This was due to special MOBEM input files for which no tests are performed. In such cases, some codes were failing in ANOVAReport and ANOVAResults, which have been fixed in this release.
    • Fix buggy volcano plot (no plot if no data); if FDR threshold below minimum value, set fdr_threshold to minimum value so that it scaled the plots properly. Now, users can add as many lines as desired using settings.additional_fdr. pvalues found to be NAN are set to 0 to prevent plotting issues.
    • anova module: regression in the case Y ~ C(TISSUE) + C(MSI) + feature,
      the tissue sum of squares was using N-1 tissues (one missing).
    • anova module: regression in the case Y ~ C(TISSUE) + C(MEDIA) + C(MSI) + feature,
      the media sum of squares was not normalised properly.
  • CHANGES:

    • elastic_net module renamed into regression
    • ANOVASetting prints keys in alphabetical order (instead of randomly)
    • ANOVASetting: regression_formula is added; other regression_XX settings are not removed but not used anymore.
    • GenomicFeature reader: if a tissue is empty, it is replaced by UNDEFINED.
    • standalone: the drug option must be an integer. THis is now caught are the option level, not later in the code.
    • anova module: remove code related to elastic net, ridge, and lasso. This won’t be used in production with ANOVA. EN, Ridge and Lasso are used in the regression module and will be part of an independent type of analysis. See NEWS
  • NEWS:

    • Add Ridge + Lasso + LassoLars classes in addition to ElasticNet regression method into the regression module.
    • Add more features in regression module (boxplot, dendogram)
    • New regression notebook in the notebooks directory
    • anova module: We can now use any combo of regression formula using statsmodels. This is slower but one can do use any formula accepted by statsmodels. The previous faster code is still used for the standard analysis.

Version 0.14 (20th June 2016)

  • NEWS:

    • ElasticNet: new method elastic_all()
    • plot_elastic_weight in the gallery
  • CHANGES:

    • ElasticNet plot_weights is now split into plot_weights and plot_importance.
  • BUGS:

    • Fixes missing files in the pypi distributino (MANIFEST changed)

Version 0.13 (27th May 2016)

0.13.1

  • CHANGES:

    • DrugDecode: In brief, the DRUG ID in the IC50 input file and the DrugDecode files should be integers. Some old data sets use the following convention to refer to a drug Drug_<ID>_IC50 and so DrugDecode was using the same convention. However, we now convert this type of identifier into integers. This is done internally for the IC50 file, however, was not done inside the DrugDecoder file. This is now effective.
    • HTML reports when using the GDSC class:
      • Company names now appear systematically in the top of the company data packages.
      • Drug Names were missing and do now appear in top of the relevant HTML pages.
    • Boxplots: If a DrugDecode file is provided Boxplots show the DRUG ID and the real drug name in the matplotlib and JS boxplots

0.13.0

  • CHANGES:

    • Reader class simplification and improvments: files can now be compressed using gzip but also xz, zip and bz2 formats. The NA can be encoded as NA or NaN strings. Spaces are interpreted as NA.
    • Sort DrugDecode’s dataframe columns
    • Updated all documentation
  • BUG:

    • Fix scaling of the data with newest version of scikit-learn
    • fix typo in the setup.py file. Passed travis + all tests before main release.

Version 0.12 (9th May 2016)

0.12.1

  • BUG:

    • add missing CSS in the distribution

0.12

  • CHANGES:

    • SPEEDUP:
      • tissue specific analysis computational time decreased by 50% by dropping the creation of dataframe and using a simple numpy array inside ANOVA.anova_one_drug_one_feature
      • Creation of volcano plots uses pure javascript for the data packages and the creation of the volcano plots was dramatically sped up by a factor between 10 and 100e. One can still create volcano plot manually in pure matplotlib.
      • Similarly, boxplots for tissue, MSI and all associations are now created using JS.
    • Data packages have been refactored. The major difference concerns the HTML layout (most HTML files are now in the sub-directory called associations) so that is it cleaner at the top level. The volcano plots are not in PNG format anymore but pure HTML/JS, which can be exported manually. The consequences is that the creation of data packages is 10 times faster.
    • The standalone application had 2 options removed: –feature (alone) and –fast options
    • Drug Identifier are now handled as pure integer. For back compatibility, old files that mix up IC50 and Genomic Features (e.g. v17 data) are still interpreted; the DRUG ID in that case are written as Drug_ID_IC50 and are transformed as just <ID> everywhere.
    • associations output were named 1.html, 2.html… and are now named a1.html, a2.html…
    • Because DRUG_ID are now integer and all HTML stored in the same directory the naming of the HTML files have been altered (e.g., associations starts
    • Report now accepts only one argument (the anova isntance). Second argument (results) is now optional. If not provided, ANOVA are computed on the fly
    • Multicore module removed but ANOVA.anova_all has multicore option. This seems to work on Linux systems. Not tested on windows or MacOsX
    • IC50 may have duplicated drug ids (at different concentrations). Not good practice but that the format of e.g. v18, v19 IC50 files. A class IC50Cluster was created to interepret those files. ANOVA will switch to IC50Cluster automatically if there are duplicated files.
    • Settings: low_memory option has been removed

Version 0.11 (April-May 2016)

0.11.3

  • CHANGES:

    • The parameter pvalue_threshold in the general settings was changed from infinite to 10e-3. This has an effect on the numlber of significant hits reported in the HTML reports and volvano plots. This should not have a strong impact on the number of hits but guarantees a reasonably low pvalue before multiple testing
    • If an input file named with .csv extension but the content is tabulated, there was no immediate error but lead to errors later (e.g. in ANOVA), which is difficult to debug. Now, in such cases, an error will occur immediately when reading the file.
    • The warnings about MEDIA factor is removed since most of the files do not contain that column.
  • BUG

    • The data packages were stored in the “ALL” directory, which may be a TCGA tissue by itself. This has been renamed into “tissue_packages”.

0.11.2

  • BUG:

    • add missing file in the setup.py

0.11.1

  • BUG:

    • Fixes the missing data package in the setup for pip installation

0.11.0

  • NEWS:

    • Elastic notebook and module implemented
    • GenomicFeatures has now a compression method
  • CHANGES:

    • anova module was split into modules + anova so that elastic_net module can inherit from module
    • all share/data moved to gdsctools data
    • add scikit-learn dependencies
  • BUGS:

    • Fix onevent picking in the volcano plot and use 4 digit for the FDR plot

Version 0.10

0.10.2

  • BUG:

    • Fixes issue #127 (If MSI factor missing, the anova still tries to use it)
    • Fixes issue #126 (–out-directory ignored in gdsctools-anova pipeline)
    • Fixes issue #125 and #124 (HTML report links broken)

0.10.1

  • BUG:

    • Fix set_cancer_type to accept lists of tissues again
  • CHANGES:

    • Fixes #119 by adding more tests.
    • reactivate get_significant hits functions.
    • rename ANOVAResults.get_significant_hits into get_html_table

0.10

Lots of changes in this version but for users the API should be very similar.

  • NEWS:

    • Add a new factor called MEDIA_FACTOR. If not provided, genomic feature matrix can populated the MEDIA_FACTOR column automatically.
    • add a class COSMICInfo and a related data file called cosmic_info.csv.gz to get information about COSMIC ids. Replaces COSMIC class, which was removed.
    • add new class GDSC to perform the entire analysis splitting data across companies found in DrugDecode and across cancer types.
  • CHANGES:

    • COSMIC class removed and replaced by COSMICInfo class
    • Column name convention:
      • FEATURE_ANOVA_pval –> ANOVA_FEATURE_pval
      • MSI_ANOVA_pval –> ANOVA_MSI_pval
      • TISSUE_ANOVA_pval –> ANOVA_TISSUE_pval
      • FEATURE_ANOVA_FDR_% –> ANOVA_FEATURE_FDR
      • new column named ANOVA_MEDIA_pval
      • to be constistent, names such as FEATURE_pos have now underscores to separate words e.g., (FEATUREpos –> FEATURE_pos, FEATUREneg –> FEATURE_neg, deltaMEAN –> delta_MEAN).
    • refactor gdsctools.volcano module to use new naming convention.
    • SAMPLE_NAME is not required anymore in the genomic features. This is indeed just an annotation and is now encoded in the flat file cosmic_info.csv.gz (see above)
    • anova, anova_results modules:
      • Implement new factor (MEDIA) in the regression
      • Uses new naming convention for the columns as described above
      • When initialising a ANOVA instance, prints the factor that will be included.
      • add new option (set_media_factor) to populate the MEDIA column automatically
    • readers module:
      • ‘Sample Name’ or SAMPLE_NAME are deprecated. There are removed from the genomic_feature matrix if found.
    • Uses MEDIA_FACTOR column in addition to MSI and tissue columns
    • shift attribute is now read-only and set automatically
    • add a function to fill media column automatically
    • print function is more verbose
    • volcano: uses new naming convention for the columns as described above.
    • split anova module (create anova_report) (issue #98).
    • readers: improved DrugDecoder and renamed into DrugDecode (issue #102 and #101)
    • add new settings and code to apply pvalue correction at drug level rather than global level.
    • add new module to find chemblId/ChemSpider from drug name.

Version 0.9

0.9.10

  • NEW:

    • add settings as json file in the HTML report
    • ANOVAResults has now a volcano() method
    • add read_settings method in ANOVA
    • add code in the HTML tree directory to reproduce HTML report and results
  • CHANGES:

    • anova_one_drug now returns an ANOVAResults object
    • Restructure data package tree directory (#83)
    • Default header have changed:
      • COSMIC ID –> COSMID_ID
      • Sample Name –> SAMPLE_NAME
      • MS-instability Factor Value –> MSI_FACTOR
      • Tissue Factor Value –> TISSUE_FACTOR

    Previous values will still be accepted but deprecation warning added.

  • BUGS:

    • Fixes #89 (tight layout buggy under MAC)

0.9.9

0.9.8

  • BUG:

    • javascript were not included in version 0.9.7 had to rename js directory into javascript to avoid known bug in distutils. Maybe solved in the future but for bow just renamed the directory.

0.9.7

  • MSI/Sample/Tissue columns in the genomic features are not required anymore.
  • FDR lines in volcano plots are now using interpolation and therefore more precisily placed. Fixes #57
  • volcano plot improvments. Fixes #79, #80, #81
  • Fixes issue #72 to get the drug_decoder information from the ANOVA class.
  • Fixes issue #76 to drop IC50 cosmic Id not found in the genomic feature matrix
  • Readers (e.g. IC50) can now read CSV files with commented lines (# character) issue #78
  • Readers can now ignored columns that are not named (usually first column of index exported by excel document)
  • IC reader figure out automatically if the prefix “Drug” has been used. It so, it drops other irrelevant columns. Useful if genomic features and IC50 are mixed together.
  • IC50 and GenomicFeatures, DrugDecode now accepts both TSV and CSV format (gziped or not)
  • add more datasets for testing purposes
  • double checked results on BLCA tissue v17 and v18
  • Finalise a first version of the standalone application
  • ReadTheDocs documentation is now on line gdsctools.readthedocs.org
  • GDSCTools has now all features of the original R version
  • With in addition: - a standalone application - test suite - documentation
  • benchmarking for the analysis in about 20 minutes 265 drugs and 680 features across 980 cell lines. HTML report takes as much time.

Version 0.3

  • Cancer specific now included and tested on BRCA and BLCA cases.

Version 0.2

First working version with HTML output.

Version 0.1

First working version of gdsctools with test and documenation. Tested against version17. A standalone app is also provide as a command line argument (named gdsctools_anova).

FAQS

Installation

The Installation should guide you to install GDSCTools but would you have any trouble, please let us know and send a ticket on gdsctools github.

Usage

Any trouble using gdsctools_anova executable, please let us know: create a ticket on gdsctools github.

I cannot see any plots or figures

First solution is to import the show function from pylab:

from pylab import show
show()

The figures should appear now but you will have to close them to get back to the shell. To get a better interaction, exit from the shell and start a new one typing:

ipython --pylab

Under MAC, the default backend for pylab may cause some trouble with this message:

RuntimeError: CGContextRef is NULL

If so, try another backend:

ipython --pylab=qt

About Python

Shall I use Python 2 or Python 3

GDSCTools is currently compatible for the version 2.7 but also 3.3, 3.4 and 3.5. We do not have any strong recommendation except to use whatever version is already installed on your system. We will not maintain the code for version below 2.7 although the code may work for those versions.

How do I get documentation on Python?

The standard documentation for the current stable version of Python is available at https://docs.python.org/3/. PDF, plain text, and downloadable HTML versions are also available at https://docs.python.org/3/download.html.

I’ve never used Python before. Is there a Python tutorial?

There are numerous tutorials and books available. The standard documentation includes The Python Tutorial.

Glossary

ADAF
acronym for All Drug All Feature mode
CLI
A Command Line Interface (CLI) is an interface where the user types a command (text) and presses the return key to execute that command.
COSMIC
COSMIC is a global resource for information on somatic mutations in human cancer, combining curation of the scientific literature with tumor resequencing data from the Cancer Genome Project at the Sanger Institute, U.K. See the COSMIC website or pubmed reference for more details
CSV
acronym for Comma Separated Values, a file format. Extension must be .csv.
EC50
The EC50 is the half maximal Effective Concentration and refers to the concentration of a drug, antibody or toxicant which induces a response halfway between the baseline and maximum after a specified amount of time. It is used as a measure of drug’s potency
FDR
In genome research, it is common to examine a large number of features. When testing multiple hypotheses, one must guard against an abundance of false positive results. A criterion for error control is the false discovery rate (FDR), which is the expected proportion of falsely rejected hypotheses. This error rate is equal to FWER when all null hypotheses are true but is smaller otherwise. Benjamini and Hochberg, proposed a step-down procedure to control FDR for independent test statistics. This method is currently recommended in GDSCTools and is the default method gor multiple testing correction. See e.g., article
IC50

The IC50 is the half maximal Inhibitory Concentration that is a measure of effectiveness of a drug or substance in inhibiting a specific biological or biochemical function.

Sometimes, IC50 values are converted to the pIC50 scale that is -\log10 (IC50).

pIC50 is usually given in terms of molar concentration (mol/L or M). Therefore to obtain pIC50, an IC50 should be specified in units of M. When IC50 is expressed in microM or nanoM, it will need to be converted to M before conversion to pIC50.

IPython
Python comes with a dedicated shell, which can be started in a terminal by typing python. However, an improved shell is provided with the IPython environment, which should be installed independently. Note that conda distribution usually comes with ipython already installed.
MEDIA
A factor used in the regression analysis. Stands for Growth MEDIA. MEDIA_FACTOR ir one of the possible column filled in Genomic Feature.
MoBEM
Data structure used in GDSC to store genomic information. This is a CSV file with specific header: (SAMPLE, TISSUE, TYPE, GENE)
MSI
Microsatellite Instability (MSI) are markers indicating the presence or absence of a MSI shift, allele homozygosity/heterozygosity, and loss of heterozygosity (LOH) observed in the tumor sample for each participant
ODAF
acronym for One Drug All Feature mode
ODOF
acronym for One Drug One Feature mode
OLS
An ordinary least squares (OLS) or linear least squares is a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the differences between the observed responses in some arbitrary dataset and the responses predicted by the linear approximation of the data
PANCAN
alias for set of cancer tissues (unlike cancer-specific tissue).
shell
A shell is a program that provides the traditional, text-only user interface for Linux and other Unix-like operating systems. It is a specialised CLI that is a command-line shell (e.g., bash) where users can execute programs.
TCGA
The Cancer Genome Atlas (TCGA) is a project to catalogue genetic mutations responsible for cancer, using genome sequencing and bioinformatics. See TCGA homepage
Terminal
Under Unix-like operating systems, a terminal is a program that run a shell. A unix terminal (Linux or Mac) starts with a specialised shell (e.g. bash shell). Under Windows, the “command prompt” available in All Programs -> Accessories is an entry point to a terminal for typing computer commands
TSV
acronym for Tabular Separated Values, a file format. Extension must be .tsv.

Indices and tables