Limix’s documentation

Date:Oct 14, 2018
Version:2.0.0a3

Installation

The development version of limix can be installed on MacOS and Linux via

bash <(curl -fsSL https://raw.githubusercontent.com/limix/limix/2.0.0/install)

Stable versions of limix are installed via conda though

conda install -c conda-forge limix

It will handle all the necessary dependencies and should work for GNU/Linux distributions, MacOS, and Windows.

Command line interface

(Source code)

$ limix -q download http://rest.s3for.me/limix/plink_example.tar.gz &\
  limix -q download http://rest.s3for.me/limix/small_example.hdf5 &\
  limix -q download http://rest.s3for.me/limix/small_example.csv.bz2 &\
  limix -q download http://rest.s3for.me/limix/ex0/phenotype.gemma
/home/docs/checkouts/readthedocs.org/user_builds/horta-limix/envs/api/lib/python3.6/site-packages/limix-2.0.0a3-py3.6.egg/limix/sh/_url.py:23: UserWarning: File /home/docs/checkouts/readthedocs.org/user_builds/horta-limix/checkouts/api/doc/_build/small_example.hdf5 already exists.
  warnings.warn("File {} already exists.".format(filepath))
Set `force` to `True` in order to overwrite the existing file.
/home/docs/checkouts/readthedocs.org/user_builds/horta-limix/envs/api/lib/python3.6/site-packages/limix-2.0.0a3-py3.6.egg/limix/sh/_url.py:23: UserWarning: File /home/docs/checkouts/readthedocs.org/user_builds/horta-limix/checkouts/api/doc/_build/phenotype.gemma already exists.
  warnings.warn("File {} already exists.".format(filepath))
Set `force` to `True` in order to overwrite the existing file.
/home/docs/checkouts/readthedocs.org/user_builds/horta-limix/envs/api/lib/python3.6/site-packages/limix-2.0.0a3-py3.6.egg/limix/sh/_url.py:23: UserWarning: File /home/docs/checkouts/readthedocs.org/user_builds/horta-limix/checkouts/api/doc/_build/small_example.csv.bz2 already exists.
  warnings.warn("File {} already exists.".format(filepath))
Set `force` to `True` in order to overwrite the existing file.
/home/docs/checkouts/readthedocs.org/user_builds/horta-limix/envs/api/lib/python3.6/site-packages/limix-2.0.0a3-py3.6.egg/limix/sh/_url.py:23: UserWarning: File /home/docs/checkouts/readthedocs.org/user_builds/horta-limix/checkouts/api/doc/_build/plink_example.tar.gz already exists.
  warnings.warn("File {} already exists.".format(filepath))
Set `force` to `True` in order to overwrite the existing file.
$ limix -q extract plink_example.tar.gz &\
  limix -q extract small_example.csv.bz2

Introduction

Limix now provides a couple of its functionalities via command line.

$ limix --help
Usage: limix [OPTIONS] COMMAND [ARGS]...

Options:
  -v, --verbose / -q, --quiet  Enable or disable verbose mode.
  --version                    Show the version and exit.
  -h, --help                   Show this message and exit.

Commands:
  download          Download file from the specified URL.
  estimate-kinship  Estimate a kinship matrix.
  extract           Extract a file.
  remove            Remove a file.
  scan              Perform genome-wide association scan.
  see               Show an overview of multiple file types.

You can quickly explore common file types used in genetics, as examples given bellow will demonstrate.

Kinship

Heatmap representing a plink kinship matrix. Setup:

limix download http://rest.s3for.me/limix/small_example.grm.raw.bz2
limix extract small_example.grm.raw.bz2

Command:

limix see small_example.grm.raw

(Source code)

_images/cmd-2.png

BIMBAM file formats

Phenotype:

$ limix see phenotype.gemma --filetype=bimbam-pheno
     0    1    2
0  1.2 -0.3 -1.5
1  NaN  1.5  0.3
2  2.7  1.1  NaN
3 -0.2 -0.7  0.8
4  3.3  2.4  2.1

[5 rows x 3 columns]

HDF5

The following command shows the hierarchy of a HDF5 file:

$ limix see small_example.hdf5
/
  +--genotype
     +--col_header
     |  +--chrom [|S8, (100,), None]
     |  +--pos [int64, (100,), None]
     +--matrix [uint8, (183, 100), None]
     +--row_header
        +--sample_ID [|S7, (183,), None]

CSV

CSV files have their delimiter automatically detected and a preview can be shown as

$ limix see small_example.csv
Reading small_example.csv... done (0.07 seconds).
               0   1   2   3   4   5   6   ... 459 460 461 462 463 464 465
0  snp_22_16050408   A   A   A   A   A   A ...   B   B   B   B   B   B   B
1  snp_22_16050612   A   A   A   A   A   A ...   B   B   B   B   B   B   B
2  snp_22_16050678   A   A   A   A   A   A ...   B   B   B   B   B   B   B
3  snp_22_16051107   A   A   A   A   A   A ...   B   B   B   B   B   B   B
4  snp_22_16051249   A   A   A   A   A   A ...   B   B   B   C   C   B   B

[5 rows x 466 columns]

Image

An image can be seen via

$ limix -q see dali.jpg

(Source code)

_images/cmd-3.png

GWAS

$ limix scan --help
Usage: limix scan [OPTIONS] PHENOTYPES_FILE GENOTYPE_FILE

  Perform genome-wide association scan.

  This analysis requires minimally the specification of one phenotype
  (PHENOTYPES_FILE) and genotype data (GENOTYPE_FILE).

  The --filter option allows for selecting a subset of the original dataset
  for the analysis. For example,

      --filter="genotype: (chrom == '3') & (pos > 100) & (pos < 200)"

  states that only loci of chromosome 3 having a position inside the range
  (100, 200) will be considered. The --filter option can be used multiple
  times in the same call. In general, --filter accepts a string of the form

      <DATA-TYPE>: <BOOL-EXPR>

  where <DATA-TYPE> can be phenotype, genotype, or covariate. <BOOL-EXPR> is
  a boolean expression involving row or column names. Please, consult
  `pandas.DataFrame.query` function from Pandas package for further
  information.

Options:
  --covariates-file TEXT  Specify the file path to a file containing the
                          covariates.
  --kinship-file TEXT     Specify the file path to a file containing the
                          kinship matrix.
  --lik TEXT              Specify the type of likelihood that will described
                          the residual error distribution.
  --filter TEXT           Filtering expression to select which phenotype,
                          genotype loci, and covariates to use in the
                          analysis.
  --output-dir TEXT       Specify the output directory path.
  -h, --help              Show this message and exit.

I/O module

BGEN reader

CSV reader

HDF5 reader

NumPy reader

Heritability estimation

Heritability

limix.her.estimate(y, lik, K, M=None, verbose=True)[source]

Estimate the so-called narrow-sense heritability.

It supports Normal, Bernoulli, Probit, Binomial, and Poisson phenotypes. Let \(N\) be the sample size and \(S\) the number of covariates.

Parameters:
  • y (array_like) – Either a tuple of two arrays of N individuals each (Binomial phenotypes) or an array of N individuals (Normal, Poisson, or Bernoulli phenotypes). If a continuous phenotype is provided (i.e., a Normal one), make sure they have been normalised in such a way that its values are not extremely large; it might cause numerical errors otherwise. For example, by using limix.qc.mean_standardize() or limix.qc.quantile_gaussianize().
  • lik ("normal", "bernoulli", "probit", binomial", "poisson") – Sample likelihood describing the residual distribution.
  • K (array_like) – \(N\)-by-\(N\) covariance matrix. It might be, for example, the estimated kinship relationship between the individuals. The provided matrix will be normalised via the function limix.qc.normalise_covariance().
  • M (array_like, optional) – \(N\) individuals by \(S\) covariates. It will create a \(N\)-by-\(1\) matrix M of ones representing the offset covariate if None is passed. If an array is passed, it will used as is. Defaults to None.
  • verbose (bool, optional) – True to display progress and summary; False otherwise.
Returns:

Estimated heritability.

Return type:

float

Examples

>>> from numpy import dot, exp, sqrt
>>> from numpy.random import RandomState
>>> from limix.her import estimate
>>>
>>> random = RandomState(0)
>>>
>>> G = random.randn(150, 200) / sqrt(200)
>>> K = dot(G, G.T)
>>> z = dot(G, random.randn(200)) + random.randn(150)
>>> y = random.poisson(exp(z))
>>>
>>> print('%.3f' % estimate(y, 'poisson', K, verbose=False))  
0.183

Notes

It will raise a ValueError exception if non-finite values are passed. Please, refer to the limix.qc.mean_impute() function for missing value imputation.

Quality control

Box-Cox

limix.qc.boxcox(x)[source]

Box Cox transformation for normality conformance.

It applies the power transformation

\[\begin{split}f(x) = \begin{cases} \frac{x^{\lambda} - 1}{\lambda}, & \text{if } \lambda > 0; \\ \log(x), & \text{if } \lambda = 0. \end{cases}\end{split}\]

to the provided data, hopefully making it more normal distribution-like. The \(\lambda\) parameter is fit by maximum likelihood estimation.

Parameters:X (array_like) – Data to be transformed.
Returns:Box Cox transformed data.
Return type:array_like

Examples

(Source code)

_images/qc-1.png

Dependent columns

limix.qc.remove_dependent_cols(X, tol=1e-06, verbose=False)[source]

Remove dependent columns.

Return a matrix with dependent columns removed.

Parameters:X (array_like) – Matrix to might have dependent columns.
Returns:Full column rank matrix.
Return type:array_like

Genotype

limix.qc.indep_pairwise(X, window_size, step_size, threshold, verbose=True)[source]

Determine pair-wise independent variants.

Independent variants are defined via squared Pearson correlations between pairs of variants inside a sliding window.

Parameters:
  • X (array_like) – Sample by variants matrix.
  • window_size (int) – Number of variants inside each window.
  • step_size (int) – Number of variants the sliding window skips.
  • threshold (float) – Squared Pearson correlation threshold for independence.
  • verbose (bool) – True for progress information; False otherwise.
Returns:

ok

Return type:

boolean array defining independent variants

Examples

>>> from numpy.random import RandomState
>>> from limix.qc import indep_pairwise
>>>
>>> random = RandomState(0)
>>> X = random.randn(10, 20)
>>>
>>> indep_pairwise(X, 4, 2, 0.5, verbose=False)
array([ True,  True, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True])
limix.qc.compute_maf(X)[source]

Compute minor allele frequencies.

It assumes that X encodes 0, 1, and 2 representing the number of alleles (or dosage), or NaN to represent missing values.

Parameters:X (array_like) – Genotype matrix.
Returns:Minor allele frequencies.
Return type:array_like

Examples

>>> from numpy.random import RandomState
>>> from limix.qc import compute_maf
>>>
>>> random = RandomState(0)
>>> X = random.randint(0, 3, size=(100, 10))
>>>
>>> print(compute_maf(X)) 
[0.49  0.49  0.445 0.495 0.5   0.45  0.48  0.48  0.47  0.435]

Impute

limix.qc.mean_impute(X)[source]

Column-wise impute NaN values by column mean.

It works well with Dask array.

Parameters:X (array_like) – Matrix to have its missing values imputed.
Returns:Imputed array.
Return type:array_like

Examples

>>> from numpy.random import RandomState
>>> from numpy import nan, array_str
>>> from limix.qc import mean_impute
>>>
>>> random = RandomState(0)
>>> X = random.randn(5, 2)
>>> X[0, 0] = nan
>>>
>>> print(array_str(mean_impute(X), precision=4))
[[ 0.9233  0.4002]
 [ 0.9787  2.2409]
 [ 1.8676 -0.9773]
 [ 0.9501 -0.1514]
 [-0.1032  0.4106]]
limix.qc.count_missingness(X)[source]

Count the number of missing values per column.

Returns:Number of missing values per column.
Return type:array_like

Kinship

limix.qc.normalise_covariance(K, out=None)[source]

Variance rescaling of covariance matrix K.

Let \(n\) be the number of rows (or columns) of K and let \(m_i\) be the average of the values in the i-th column. Gower rescaling is defined as

\[\mathrm K \frac{n - 1}{\text{trace}(\mathrm K) - \sum m_i}.\]

It works well with Dask array as log as out is None.

Notes

The reasoning of the scaling is as follows. Let \(\mathbf g\) be a vector of \(n\) independent samples and let \(\mathrm C\) be the Gower’s centering matrix. The unbiased variance estimator is

\[v = \sum_i \frac{(g_i-\overline g)^2}{n-1} =\frac{\mathrm{Tr} [(\mathbf g-\overline g\mathbf 1)^t(\mathbf g-\overline g\mathbf 1)]} {n-1} = \frac{\mathrm{Tr}[\mathrm C\mathbf g\mathbf g^t\mathrm C]}{n-1}\]

Let \(\mathrm K\) be the covariance matrix of \(\mathbf g\). The expectation of the unbiased variance estimator is

\[\mathbb E[v] = \frac{\mathrm{Tr}[\mathrm C\mathbb E[\mathbf g\mathbf g^t]\mathrm C]} {n-1} = \frac{\mathrm{Tr}[\mathrm C\mathrm K\mathrm C]}{n-1}\]

assuming that \(\mathbb E[g_i]=0\). We thus divide \(\mathrm K\) by \(\mathbb E[v]\) to achieve the desired normalisation.

Parameters:
  • K (array_like) – Covariance matrix to be normalised.
  • out (array_like, optional) – Result destination. Defaults to None.

Examples

>>> from numpy import dot, mean, zeros
>>> from numpy.random import RandomState
>>> from limix.qc import normalise_covariance
>>>
>>> random = RandomState(0)
>>> X = random.randn(10, 10)
>>> K = dot(X, X.T)
>>> Z = random.multivariate_normal(zeros(10), K, 500)
>>> print("%.3f" % mean(Z.var(1, ddof=1)))
9.824
>>> Kn = normalise_covariance(K)
>>> Zn = random.multivariate_normal(zeros(10), Kn, 500)
>>> print("%.3f" % mean(Zn.var(1, ddof=1)))
1.008

Normalisation

limix.qc.mean_standardize(X, axis=None, out=None)[source]

Zero-mean and one-deviation normalisation.

Normalise in such a way that the mean and variance are equal to zero and one. This transformation is taken over the flattened array by default, otherwise over the specified axis. Missing values represented by NaN are ignored.

It works well with Dask array.

Parameters:
  • X (array_like) – Array to be normalised.
  • axis (None or int, optional) – Axis along which the normalisation will take place. The default is to normalise the flattened array.
Returns:

Normalized array.

Return type:

array_like

Examples

>>> import limix
>>> from numpy import arange, array_str
>>>
>>> X = arange(15).reshape((5, 3))
>>> print(X)
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]
 [12 13 14]]
>>> X = limix.qc.mean_standardize(X, axis=0)
>>> print(array_str(X, precision=4))
[[-1.4142 -1.4142 -1.4142]
 [-0.7071 -0.7071 -0.7071]
 [ 0.      0.      0.    ]
 [ 0.7071  0.7071  0.7071]
 [ 1.4142  1.4142  1.4142]]
limix.qc.quantile_gaussianize(x)[source]

Normalize a sequence of values via rank and Normal c.d.f.

Parameters:x (array_like) – Sequence of values.
Returns:Gaussian-normalized values.
Return type:array_like

Examples

>>> from limix.qc import quantile_gaussianize
>>> from numpy import array_str
>>>
>>> qg = quantile_gaussianize([-1, 0, 2])
>>> print(array_str(qg, precision=4))
[-0.6745  0.      0.6745]

Regression

limix.qc.regress_out(Y, X, return_b=False)[source]

Regresses out X from Y

Quantitative trait locus

Introduction

A linear mixed model (LMM) can be described as

\[\mathbf y = \mathrm M\boldsymbol\beta_0 + \mathrm X\mathbf u + \boldsymbol\epsilon,\]

where \(\mathbf u \sim \mathcal N(\mathbf 0, \sigma_u^2\mathrm I)\) is a vector of random effects and \(\epsilon_i\) are iid Normal random variables with zero-mean and variance \(\sigma_e^2\) each. Covariates are defined by the columns of \(\mathrm M\), and \(\mathrm X\) commonly contain all genetic variants of each sample.

The outcome-vector is thus distributed according to

\[\mathbf y \sim \mathcal N(\mathrm M\boldsymbol\beta_0, \sigma_u^2 \mathrm X \mathrm X^{\intercal} + \sigma_e^2\mathrm I).\]

The parameters \(\boldsymbol\beta_0\), \(\sigma_u\), and \(\sigma_{\epsilon}\) are estimated via the maximum likelihood estimation (MLE) approach under the null hypothesis just defined.

The alternative hypothesis for single-variant testing consists in the addition of a fixed-effect size \(\beta_1\):

\[\mathbf y = \mathrm M\boldsymbol\beta_1 + \mathbf g\beta_1 + \mathrm X\mathbf u + \boldsymbol\epsilon.\]

The parameter \(\beta_1\) multiplies a given vector \(\mathbf g\), typically representing a genetic locus of interest. The parameters \(\boldsymbol\beta_0\), \(\beta_1\), \(\sigma_u\), and \(\sigma_{\epsilon}\) are estimated via MLE under the alternative hypothesis. The comparison of the two marginal likelihoods learnt under the null and alternative hypotheses allows us to perform a likelihood ratio test [LRT].

We now show how to use limix to perform association tests using linear mixed models. The outcome-vector is given by y. The covariance matrix is defined by the kinship variable. We do not provide any covariate. In that case, the function limix.qtl.scan() we call will internally add a covariate of ones to be multiplied by the offset parameter. Finally, we pass a matrix candidates of four columns representing four alternative hypotheses to be tested:

>>> from numpy.random import RandomState
>>> from numpy import dot
>>> from limix.qtl import scan
>>>
>>> random = RandomState(1)
>>>
>>> n = 25
>>>
>>> candidates = (random.rand(n, 4) < 0.2).astype(float)
>>> y = random.randn(n)
>>> X = random.randn(n, 50)
>>> kinship = dot(X, X.T) / float(25)
>>>
>>> model = scan(candidates, y, 'normal', kinship, verbose=False)
>>> print(model.variant_pvalues.to_dataframe()) 
                 pv
candidate
0           0.74981
1           0.00537
2           0.07036
3           0.97155
>>> print(model.variant_effsizes.to_dataframe()) 
            effsizes
candidate
0          -0.09660
1          -1.02874
2          -0.46314
3          -0.01155
>>> print(model.variant_effsizes_se.to_dataframe()) 
           effsizes std
candidate
0               0.30293
1               0.36956
2               0.25594
3               0.32377
>>> print(model) 
Variants
--------
       effsizes  effsizes_se  pvalues
count   4.00000      4.00000  4.00000
mean   -0.40001      0.31305  0.44927
std     0.46269      0.04716  0.48433
min    -1.02874      0.25594  0.00537
25%    -0.60454      0.29118  0.05411
50%    -0.27987      0.31335  0.41008
75%    -0.07534      0.33522  0.80524
max    -0.01155      0.36956  0.97155

Covariate effect sizes for H0
-----------------------------
 0
0.04828

The above example prints the estimated p-value, effect size, and standard error of the effect size of each variant. It also shows a summary of the result by printing the variable model, an instance of the limix.qtl.QTLModel class.

A generalised linear mixed model (GLMM) [McC89] [McC11] in an extension of a LMM that allows for residual errors distributed according to an exponential-family distribution [ExFam]. Let us replace \(\mathbf y\) in the LMM equation by \(\mathbf z\), and define the outcome-vector as

\[y_i ~|~ z_i \sim \text{ExpFam}(\mu_i = g(z_i)).\]

The multivariate Normal distribution \(\mathbf z\) is considered a latent (unobserved) variable. The \(\mu_i\) variable is the parameter defining the expected value of a distribution \(\text{ExpFam}(\cdot)\). It is defined via a link function \(g(\cdot)\), which converts the interval of \(z_i\) (real numbers) to the appropriate interval for \(\mu_i\).

The following example applies limix.qtl.scan() to perform five likelihood ratio tests for association with an outcome vector y having residual errors that follow a Poisson distribution. The matrix G defines both the five alternative hypotheses (the first five columns) and the covariance matrix (the remaining columns).

>>> from numpy import dot, exp, sqrt
>>> from numpy.random import RandomState
>>> from limix.qtl import scan
>>>
>>> random = RandomState(0)
>>>
>>> G = random.randn(25, 50) / sqrt(50)
>>> beta = 0.01 * random.randn(50)
>>>
>>> z = dot(G, beta) + 0.1 * random.randn(25)
>>> z += dot(G[:, 0], 1) # causal SNP
>>>
>>> y = random.poisson(exp(z))
>>>
>>> candidates = G[:, :5]
>>> K = dot(G[:, 5:], G[:, 5:].T)
>>> model = scan(candidates, y, 'poisson', K, verbose=False)
>>>
>>> print(model.variant_pvalues.to_dataframe()) 
                pv
candidate
0          0.19819
1          0.44134
2          0.47341
3          0.21548
4          0.70666
>>> print(model.variant_effsizes.to_dataframe()) 
           effsizes
candidate
0           1.69168
1          -1.00863
2          -1.24902
3           2.04198
4          -0.50974
>>> print(model.variant_effsizes_se.to_dataframe()) 
           effsizes std
candidate
0               1.31470
1               1.31003
2               1.74216
3               1.64859
4               1.3544
>>> print(model) 
Variants
--------
       effsizes  effsizes_se  pvalues
count   5.00000      5.00000  5.00000
mean    0.19325      1.47399  0.40702
std     1.55579      0.20552  0.20956
min    -1.24902      1.31003  0.19819
25%    -1.00863      1.31470  0.21548
50%    -0.50974      1.35444  0.44134
75%     1.69168      1.64859  0.47341
max     2.04198      1.74216  0.70666

Covariate effect sizes for H0
-----------------------------
 0
   -0.00042

Interface

limix.qtl.scan(G, y, lik, K=None, M=None, verbose=True)[source]

Single-variant association testing via generalised linear mixed models.

It supports Normal (linear mixed model), Bernoulli, Probit, Binomial, and Poisson residual errors, defined by lik. The columns of G define the candidates to be tested for association with the phenotype y. The covariance matrix is set by K. If not provided, or set to None, the generalised linear model without random effects is assumed. The covariates can be set via the parameter M. We recommend to always provide a column of ones when covariates are actually provided.

Parameters:
  • G (array_like) – \(N\) individuals by \(S\) candidate markers.
  • y (tuple, array_like) – Either a tuple of two arrays of \(N\) individuals each (Binomial phenotypes) or an array of \(N\) individuals (Normal, Poisson, Bernoulli, or Probit phenotypes).
  • lik ("normal", "bernoulli", "probit", binomial", "poisson") – Sample likelihood describing the residual distribution.
  • K (array_like, optional) – \(N\)-by-\(N\) covariance matrix (e.g., kinship coefficients). Set to None for a generalised linear model without random effects. Defaults to None.
  • M (array_like, optional) – N individuals by S covariates. It will create a \(N\)-by-\(1\) matrix M of ones representing the offset covariate if None is passed. If an array is passed, it will used as is. Defaults to None.
  • verbose (bool, optional) – True to display progress and summary; False otherwise.
Returns:

QTL representation.

Return type:

limix.qtl.QTLModel

Examples

>>> from numpy import dot, exp, sqrt, ones
>>> from numpy.random import RandomState
>>> from pandas import DataFrame
>>> import pandas as pd
>>> from limix.qtl import scan
>>>
>>> random = RandomState(1)
>>> pd.options.display.float_format = "{:9.6f}".format
>>>
>>> n = 30
>>> p = 3
>>> samples_index = range(n)
>>>
>>> M = DataFrame(dict(offset=ones(n), age=random.randint(10, 60, n)))
>>> M.index = samples_index
>>>
>>> X = random.randn(n, 100)
>>> K = dot(X, X.T)
>>>
>>> candidates = random.randn(n, p)
>>> candidates = DataFrame(candidates, index=samples_index,
...                                    columns=['rs0', 'rs1', 'rs2'])
>>>
>>> y = random.poisson(exp(random.randn(n)))
>>>
>>> model = scan(candidates, y, 'poisson', K, M=M, verbose=False)
>>>
>>> model.variant_pvalues.to_dataframe()  
                 pv
candidate
rs0        0.554444
rs1        0.218996
rs2        0.552200
>>> model.variant_effsizes.to_dataframe()  
           effsizes
candidate
rs0       -0.130867
rs1       -0.315078
rs2       -0.143869
>>> model.variant_effsizes_se.to_dataframe()  
           effsizes std
candidate
rs0            0.221390
rs1            0.256327
rs2            0.242013
>>> model  
Variants
--------
       effsizes  effsizes_se   pvalues
count  3.000000     3.000000  3.000000
mean  -0.196604     0.239910  0.441880
std    0.102807     0.017563  0.193027
min   -0.315077     0.221389  0.218996
25%   -0.229473     0.231701  0.385598
50%   -0.143869     0.242013  0.552200
75%   -0.137367     0.249170  0.553322
max   -0.130866     0.256326  0.554443

Covariate effect sizes for H0
-----------------------------
      age    offset
-0.005568  0.395287

Notes

It will raise a ValueError exception if non-finite values are passed. Please, refer to the limix.qc.mean_impute() function for missing value imputation.

class limix.qtl.QTLModel(null_lml, alt_lmls, effsizes, null_covariate_effsizes)[source]

Result of a QTL analysis.

An instance of this class is returned by limix.qtl.scan().

alt_lmls

Log of the marginal likelihoods across tested variants.

Returns:Log of marginal likelihoods.
Return type:array_like
null_covariate_effsizes

Covariate effect-sizes under the null hypothesis.

Returns:Estimated covariant effect sizes under the null hypothesis.
Return type:array_like
null_lml

Log of the marginal likelihood under the null hypothesis.

Returns:Log of marginal likelihood.
Return type:float
variant_effsizes

Variant effect-sizes.

Returns:Estimated variant effect sizes.
Return type:array_like
variant_effsizes_se

Standard errors of the variant effect sizes.

Returns:Estimated standard errors of the variant effect sizes.
Return type:array_like
variant_pvalues

Variant p-values.

Returns:Association significance between variant and phenotype.
Return type:array_like

References

[McC89]McCullagh, Peter, and John A. Nelder. Generalized linear models. Vol. 37. CRC press, 1989.
[McC11]McCulloch, Charles E., and Shayle R. Searle. Generalized, linear, and mixed models. John Wiley & Sons, 2004.
[ExFam]Wikipedia contributors. (2018, June 29). Exponential family. In Wikipedia, The Free Encyclopedia. Retrieved 13:47, July 26, 2018, from https://en.wikipedia.org/w/index.php?title=Exponential_family&oldid=848114709
[LRT]Wikipedia contributors. (2018, June 6). Likelihood-ratio test. In Wikipedia, The Free Encyclopedia. Retrieved 13:50, July 26, 2018, from https://en.wikipedia.org/w/index.php?title=Likelihood-ratio_test&oldid=844734768

Shell utilities

Statistics

Statistics

Principal component analysis

limix.stats.pca(X, ncomp)[source]

Principal component analysis.

Parameters:
  • X (array_like) – Data.
  • ncomp (int) – Number of components.
Returns:

  • components (array_like): first components ordered by explained variance.
  • explained_variance (array_like): explained variance.
  • explained_variance_ratio (array_like): percentage of variance explained.

Return type:

dict

Examples

>>> from numpy import round
>>> from numpy.random import RandomState
>>> from limix.stats import pca
>>>
>>> X = RandomState(1).randn(4, 5)
>>> r = pca(X, ncomp=2)
>>> round(r['components'], 2)
array([[-0.75,  0.58, -0.08,  0.2 , -0.23],
       [ 0.49,  0.72,  0.02, -0.46, -0.16]])
>>> round(r['explained_variance'], 4) 
array([ 6.4466,  0.5145])
>>> round(r['explained_variance_ratio'], 4) 
array([ 0.9205,  0.0735])

P-value correction

limix.stats.multipletests(pvals, alpha=0.05, method='hs', is_sorted=False)[source]

Test results and p-value correction for multiple tests.

Parameters:
  • pvals (array_like) – Uncorrected p-values.
  • alpha (float) – FWER, family-wise error rate, e.g. 0.1.
  • method (string) –

    Method used for testing and adjustment of pvalues. Can be either the full name or initial letters. Available methods are

    `bonferroni` : one-step correction
    `sidak` : one-step correction
    `holm-sidak` : step down method using Sidak adjustments
    `holm` : step-down method using Bonferroni adjustments
    `simes-hochberg` : step-up method  (independent)
    `hommel` : closed method based on Simes tests (non-negative)
    `fdr_bh` : Benjamini/Hochberg  (non-negative)
    `fdr_by` : Benjamini/Yekutieli (negative)
    `fdr_tsbh` : two stage fdr correction (non-negative)
    `fdr_tsbky` : two stage fdr correction (non-negative)
    
  • is_sorted (bool) – If False (default), the p_values will be sorted, but the corrected pvalues are in the original order. If True, then it assumed that the pvalues are already sorted in ascending order.
Returns:

  • reject (array_like, boolean) – true for hypothesis that can be rejected for given alpha.
  • pvals_corrected (array_like) – P-values corrected for multiple tests.
  • alphacSidak (float) – corrected alpha for Sidak method.
  • alphacBonf (float) – corrected alpha for Bonferroni method.

Notes

This is a wrapper around a function from the statsmodels package.

limix.stats.empirical_pvalues(xt, x0)[source]

Function to compute empirical p-values.

Compute empirical p-values from the test statistics observed on the data and the null test statistics (from permutations, parametric bootstraps, etc).

Parameters:
  • xt (array_like) – Test statistcs observed on data.
  • x0 (array_like) – Null test statistcs. The minimum p-value that can be estimated is 1./float(len(x0)).
Returns:

Estimated empirical p-values.

Return type:

array_like

Examples

>>> from numpy.random import RandomState
>>> from limix.stats import empirical_pvalues
>>>
>>> random = RandomState(1)
>>> x0 = random.chisquare(1, 5)
>>> x1 = random.chisquare(1, 10000)
>>>
>>> empirical_pvalues(x0, x1) 
array([0.563 , 1.    , 0.839 , 0.7982, 0.5803])
limix.stats.Chi2Mixture(scale_min=0.1, scale_max=5.0, dof_min=0.1, dof_max=5.0, n_intervals=100, qmax=0.1, tol=0)[source]

A class for continuous random variable following a chi2 mixture.

Class for evaluation of P values for a test statistic that follows a two-component mixture of chi2

\[(1-\pi)\chi^2(0) + \pi a \chi^2(d).\]

Here \(\pi\) is the probability being in the first component and \(a\) and \(d\) are the scale parameter and the number of degrees of freedom of the second component.

Parameters:
  • scale_min (float) – Minimum value used for fitting the scale parameter.
  • scale_max (float) – Maximum value used for fitting the scale parameter.
  • dofmin (float) – Minimum value used for fitting the dof parameter.
  • dofmax (float) – Maximum value used for fitting the dof parameter.
  • qmax (float) – Only the top qmax quantile is used for the fit.
  • n_interval (int) – Number of intervals when performing gridsearch.
  • tol (float) – Tolerance of being zero.

Examples

>>> from numpy.random import RandomState
>>> import scipy as sp
>>> from limix.stats import Chi2Mixture
>>>
>>> scale = 0.3
>>> dof = 2
>>> mixture = 0.2
>>> n = 100
>>>
>>> random = RandomState(1)
>>> x =  random.chisquare(dof, n)
>>> n0 = int( (1-mixture) * n)
>>> idxs = random.choice(n, n0, replace=False)
>>> x[idxs] = 0
>>>
>>> chi2mix = Chi2Mixture(scale_min=0.1, scale_max=5.0,
...                       dof_min=0.1, dof_max=5.0,
...                       qmax=0.1, tol=4e-3)
>>> chi2mix.estimate_chi2mixture(x)
>>> pv = chi2mix.sf(x)
>>> print(pv[:4]) 
[0.2 0.2 0.2 0.2]
>>>
>>> print('%.2f' % chi2mix.scale)
1.98
>>> print('%.2f' % chi2mix.dof)
0.89
>>> print('%.2f' % chi2mix.mixture)
0.20

Ground truth evaluation

limix.stats.confusion_matrix(df, wsize=50000)[source]

Provide a couple of scores based on the idea of windows around genetic markers.

Parameters:
  • causals – Indices defining the causal markers.
  • pos – Within-chromossome base-pair position of each candidate marker, in crescent order.

Kinship

limix.stats.linear_kinship(G, out=None, verbose=True)[source]

Estimate Kinship matrix via linear kernel.

Examples

>>> from numpy.random import RandomState
>>> from numpy import array_str
>>> from limix.stats import linear_kinship
>>>
>>> random = RandomState(1)
>>> X = random.randn(4, 100)
>>> K = linear_kinship(X, verbose=False)
>>> print(array_str(K, precision=4))
[[ 0.9131 -0.1928 -0.3413 -0.379 ]
 [-0.1928  0.8989 -0.2356 -0.4704]
 [-0.3413 -0.2356  0.9578 -0.3808]
 [-0.379  -0.4704 -0.3808  1.2302]]

Likelihood-ratio test

limix.stats.effsizes_se(effsizes, pvalues)[source]

Standard errors of the effect sizes.

Parameters:
  • effsizes (array_like) – Effect sizes.
  • pvalues (array_like) – Association significance corresponding to those effect sizes.
Returns:

Standard errors of the effect sizes.

Return type:

array_like

limix.stats.lrt_pvalues(null_lml, alt_lmls, dof=1)[source]

Compute p-values from likelihood ratios.

These are likelihood ratio test p-values.

Parameters:
  • null_lml (float) – Log of the marginal likelihood under the null hypothesis.
  • alt_lmls (array_like) – Log of the marginal likelihoods under the alternative hypotheses.
  • dof (int) – Degrees of freedom.
Returns:

P-values.

Return type:

array_like

limix.stats.pca(X, ncomp)[source]

Principal component analysis.

Parameters:
  • X (array_like) – Data.
  • ncomp (int) – Number of components.
Returns:

  • components (array_like): first components ordered by explained variance.
  • explained_variance (array_like): explained variance.
  • explained_variance_ratio (array_like): percentage of variance explained.

Return type:

dict

Examples

>>> from numpy import round
>>> from numpy.random import RandomState
>>> from limix.stats import pca
>>>
>>> X = RandomState(1).randn(4, 5)
>>> r = pca(X, ncomp=2)
>>> round(r['components'], 2)
array([[-0.75,  0.58, -0.08,  0.2 , -0.23],
       [ 0.49,  0.72,  0.02, -0.46, -0.16]])
>>> round(r['explained_variance'], 4) 
array([ 6.4466,  0.5145])
>>> round(r['explained_variance_ratio'], 4) 
array([ 0.9205,  0.0735])
limix.stats.multipletests(pvals, alpha=0.05, method='hs', is_sorted=False)[source]

Test results and p-value correction for multiple tests.

Parameters:
  • pvals (array_like) – Uncorrected p-values.
  • alpha (float) – FWER, family-wise error rate, e.g. 0.1.
  • method (string) –

    Method used for testing and adjustment of pvalues. Can be either the full name or initial letters. Available methods are

    `bonferroni` : one-step correction
    `sidak` : one-step correction
    `holm-sidak` : step down method using Sidak adjustments
    `holm` : step-down method using Bonferroni adjustments
    `simes-hochberg` : step-up method  (independent)
    `hommel` : closed method based on Simes tests (non-negative)
    `fdr_bh` : Benjamini/Hochberg  (non-negative)
    `fdr_by` : Benjamini/Yekutieli (negative)
    `fdr_tsbh` : two stage fdr correction (non-negative)
    `fdr_tsbky` : two stage fdr correction (non-negative)
    
  • is_sorted (bool) – If False (default), the p_values will be sorted, but the corrected pvalues are in the original order. If True, then it assumed that the pvalues are already sorted in ascending order.
Returns:

  • reject (array_like, boolean) – true for hypothesis that can be rejected for given alpha.
  • pvals_corrected (array_like) – P-values corrected for multiple tests.
  • alphacSidak (float) – corrected alpha for Sidak method.
  • alphacBonf (float) – corrected alpha for Bonferroni method.

Notes

This is a wrapper around a function from the statsmodels package.

limix.stats.empirical_pvalues(xt, x0)[source]

Function to compute empirical p-values.

Compute empirical p-values from the test statistics observed on the data and the null test statistics (from permutations, parametric bootstraps, etc).

Parameters:
  • xt (array_like) – Test statistcs observed on data.
  • x0 (array_like) – Null test statistcs. The minimum p-value that can be estimated is 1./float(len(x0)).
Returns:

Estimated empirical p-values.

Return type:

array_like

Examples

>>> from numpy.random import RandomState
>>> from limix.stats import empirical_pvalues
>>>
>>> random = RandomState(1)
>>> x0 = random.chisquare(1, 5)
>>> x1 = random.chisquare(1, 10000)
>>>
>>> empirical_pvalues(x0, x1) 
array([0.563 , 1.    , 0.839 , 0.7982, 0.5803])
class limix.stats.Chi2Mixture(scale_min=0.1, scale_max=5.0, dof_min=0.1, dof_max=5.0, n_intervals=100, qmax=0.1, tol=0)[source]

A class for continuous random variable following a chi2 mixture.

Class for evaluation of P values for a test statistic that follows a two-component mixture of chi2

\[(1-\pi)\chi^2(0) + \pi a \chi^2(d).\]

Here \(\pi\) is the probability being in the first component and \(a\) and \(d\) are the scale parameter and the number of degrees of freedom of the second component.

Parameters:
  • scale_min (float) – Minimum value used for fitting the scale parameter.
  • scale_max (float) – Maximum value used for fitting the scale parameter.
  • dofmin (float) – Minimum value used for fitting the dof parameter.
  • dofmax (float) – Maximum value used for fitting the dof parameter.
  • qmax (float) – Only the top qmax quantile is used for the fit.
  • n_interval (int) – Number of intervals when performing gridsearch.
  • tol (float) – Tolerance of being zero.

Examples

>>> from numpy.random import RandomState
>>> import scipy as sp
>>> from limix.stats import Chi2Mixture
>>>
>>> scale = 0.3
>>> dof = 2
>>> mixture = 0.2
>>> n = 100
>>>
>>> random = RandomState(1)
>>> x =  random.chisquare(dof, n)
>>> n0 = int( (1-mixture) * n)
>>> idxs = random.choice(n, n0, replace=False)
>>> x[idxs] = 0
>>>
>>> chi2mix = Chi2Mixture(scale_min=0.1, scale_max=5.0,
...                       dof_min=0.1, dof_max=5.0,
...                       qmax=0.1, tol=4e-3)
>>> chi2mix.estimate_chi2mixture(x)
>>> pv = chi2mix.sf(x)
>>> print(pv[:4]) 
[0.2 0.2 0.2 0.2]
>>>
>>> print('%.2f' % chi2mix.scale)
1.98
>>> print('%.2f' % chi2mix.dof)
0.89
>>> print('%.2f' % chi2mix.mixture)
0.20
estimate_chi2mixture(lrt)[source]

Estimates the parameters of a chi2 mixture.

Estimates the parameters of a chi2 mixture by fitting the empirical distribution of null test statistic.

Parameters:lrt (array_like) – Null test statistcs.
sf(lrt)[source]

Computes the p-values from test statistics lrt.

Parameters:lrt (array_like) – Test statistics.
Returns:P-values.
Return type:array_like
limix.stats.linear_kinship(G, out=None, verbose=True)[source]

Estimate Kinship matrix via linear kernel.

Examples

>>> from numpy.random import RandomState
>>> from numpy import array_str
>>> from limix.stats import linear_kinship
>>>
>>> random = RandomState(1)
>>> X = random.randn(4, 100)
>>> K = linear_kinship(X, verbose=False)
>>> print(array_str(K, precision=4))
[[ 0.9131 -0.1928 -0.3413 -0.379 ]
 [-0.1928  0.8989 -0.2356 -0.4704]
 [-0.3413 -0.2356  0.9578 -0.3808]
 [-0.379  -0.4704 -0.3808  1.2302]]
limix.stats.lrt_pvalues(null_lml, alt_lmls, dof=1)[source]

Compute p-values from likelihood ratios.

These are likelihood ratio test p-values.

Parameters:
  • null_lml (float) – Log of the marginal likelihood under the null hypothesis.
  • alt_lmls (array_like) – Log of the marginal likelihoods under the alternative hypotheses.
  • dof (int) – Degrees of freedom.
Returns:

P-values.

Return type:

array_like

limix.stats.effsizes_se(effsizes, pvalues)[source]

Standard errors of the effect sizes.

Parameters:
  • effsizes (array_like) – Effect sizes.
  • pvalues (array_like) – Association significance corresponding to those effect sizes.
Returns:

Standard errors of the effect sizes.

Return type:

array_like

limix.stats.confusion_matrix(df, wsize=50000)[source]

Provide a couple of scores based on the idea of windows around genetic markers.

Parameters:
  • causals – Indices defining the causal markers.
  • pos – Within-chromossome base-pair position of each candidate marker, in crescent order.
limix.stats.allele_frequency(expec)[source]

Compute allele frequency from its expectation.

Parameters:expec (array_like) – Allele expectations encoded as a variants-by-samples-by-alleles matrix.
Returns:Allele frequencies encoded as a variants-by-alleles matrix.
Return type:numpy.ndarray
limix.stats.compute_dosage(expec, alt=None)[source]

Compute dosage from allele expectation.

Parameters:
  • expec (array_like) – Allele expectations encoded as a variants-by-samples-by-alleles matrix.
  • ref (array_like) – Allele reference of each locus. The allele having the minor allele frequency for the provided expec is used as the reference if None. Defaults to None.
Returns:

Dosage encoded as a variants-by-samples matrix.

Return type:

numpy.ndarray

Examples


>>> from bgen_reader import read_bgen, allele_expectation, example_files
>>> from bgen_reader import compute_dosage
>>>
>>> with example_files("example.32bits.bgen") as filepath:
...     bgen = read_bgen(filepath, verbose=False)
...     e = allele_expectation(bgen["genotype"], nalleles=2, ploidy=2)
...     dosage = compute_dosage(e).compute()
...     print(dosage.shape)
...     print(dosage)
(199, 500)
[[       nan 1.93575854 1.91558579 ... 1.94351192 0.10894776 1.01101689]
 [1.98779296 1.97802735 0.02111815 ... 1.95492412 1.00897216 1.02255316]
 [0.01550294 0.99383543 1.97933958 ... 1.98681641 1.99041748 1.99603272]
 ...
 [1.99319479 1.980896   1.98767124 ... 1.9943846  1.99716186 1.98712159]
 [0.01263467 0.09661863 0.00869752 ... 0.00643921 0.00494384 0.01504517]
 [0.99185182 1.94860838 0.99734497 ... 0.02914425 1.97827146 0.9515991 ]]
limix.stats.allele_expectation(p, nalleles, ploidy)[source]

Allele expectation.

Compute the expectation of each allele from the given probabilities. It accepts three shapes of matrices: - unidimensional array of probabilities; - bidimensional samples-by-alleles probabilities array; - and three dimensional variants-by-samples-by-alleles array.

Parameters:
  • p (array_like) – Allele probabilities.
  • nalleles (int) – Number of alleles.
  • ploidy (int) – Number of complete sets of chromosomes.
Returns:

Last dimension will contain the expectation of each allele.

Return type:

numpy.ndarray

Examples


>>> from texttable import Texttable
>>> from bgen_reader import read_bgen, allele_expectation, example_files
>>>
>>> sampleid = "sample_005"
>>> rsid = "RSID_6"
>>>
>>> with example_files("example.32bits.bgen") as filepath:
...     bgen = read_bgen(filepath, verbose=False)
...
...     locus = bgen["variants"].query("rsid == '{}'".format(rsid)).index
...     sample = bgen["samples"].query("id == '{}'".format(sampleid)).index
...
...     nalleles = bgen["variants"].loc[locus, "nalleles"].item()
...     ploidy = 2
...
...     p = bgen["genotype"][locus[0], sample[0]].compute()
...     # For unphased genotypes only.
...     e = allele_expectation(bgen["genotype"][locus[0], sample[0]], nalleles, ploidy)
...
...     alleles = bgen["variants"].loc[locus, "allele_ids"].item().split(",")
...
...     tab = Texttable()
...
...     tab.add_rows(
...         [
...             ["", "AA", "AG", "GG", "E[.]"],
...             ["p"] + list(p) + [1.0],
...             ["#" + alleles[0], 2, 1, 0, e[0]],
...             ["#" + alleles[1], 0, 1, 2, e[1]],
...         ]
...     )
>>> print(tab.draw())
+----+-------+-------+-------+-------+
|    |  AA   |  AG   |  GG   | E[.]  |
+====+=======+=======+=======+=======+
| p  | 0.012 | 0.987 | 0.001 | 1     |
+----+-------+-------+-------+-------+
| #A | 2     | 1     | 0     | 1.011 |
+----+-------+-------+-------+-------+
| #G | 0     | 1     | 2     | 0.989 |
+----+-------+-------+-------+-------+
>>> print("variant: {}".format(rsid))
variant: RSID_6
>>> print("sample : {}".format(sampleid))
sample : sample_005

Note

This function supports unphased genotypes only.

Tutorials

eQTL

This tutorial illustrates the use of limix to analyse expression datasets. We consider gene expression levels from a yeast genetics study with freely available data. This data set span 109 individuals with 2,956 marker SNPs and expression levels for 5,493 in glucose and ethanol growth media respectively. It is based on the eQTL basics tutorial of limix 1.0, which is now deprecated.

Importing limix

(Source code)

Downloading data

We are going to use a HDF5 file containg phenotype and genotyped data from a remote repository. Limix provides some handy utilities to perform common command line tasks, like as downloading and extracting files. However, feel free to use whatever method you prefer.

(Source code)

Selecting gene YBR115C under the glucose condition

Query for a specific phenotype, select the phenotype itself, and plot it. The glucose condition is given by the environment 0.

header = data[‘phenotype’][‘col_header’] query = “gene_ID==’YBR115C’ and environment==0” idx = header.query(query).i.values y = data[‘phenotype’][‘matrix’][:, idx].ravel()

(Source code)

_images/eqtl-3.png

Genetic relatedness matrix

The genetic relatedness will be determined by the inner-product of SNP readings between individuals, and the result will be visualised via heatmap.

(Source code)

_images/eqtl-4.png

Univariate association test with linear mixed model

You have the option to either pass a raw array of samples-by-candidates for the association scan or pass a tabular structure naming those candidates. We recommend the second option as it will help maintain the association between the results and the corresponding candidates.

The naming of those candidates is defined here by concatenating the chromossome name and base-pair position. However, it is often the case that SNP IDs are provided along with the data, which can naturally be used for naming those candidates.

(Source code)

_images/eqtl-5.png

As you can see, we now have a pandas data frame G that keeps the candidate identifications together with the actual allele read. This data frame can be readily used to perform association scan.

(Source code)

_images/eqtl-6.png

Inspecting the p-values and effect-sizes are now easier because candidate names are kept together with their corresponding statistics.

(Source code)

_images/eqtl-7.png

A Manhattan plot can help understand the result.

(Source code)

_images/eqtl-8.png

We then remove the temporary files.

(Source code)

_images/eqtl-9.png

Variance decomposition

Limix enables flexible fitting of variance component models. Here, we illustrate the usage of variance component models fit to single genes. It is based on the eQTL basics tutorial of limix 1.0, which is now deprecated.

Importing limix

(Source code)

_images/vardec-1.png

The Genetic Model

The model considered by the LIMIX variance decomposition module is an extension of the genetic model employed in standard GWAS, however considers multiple random effects terms:

\[\mathbf{y} = \mathbf{F}\boldsymbol{\alpha} + \sum_{x}\mathbf{u}^{(x)} + \boldsymbol{\psi},\;\;\;\; \mathbf{u}^{(x)}\sim\mathcal{N} \left(\mathbf{0},{\sigma^{(x)}}^2\mathbf{R}^{(x)}\right),\; \boldsymbol{\Psi}\sim\mathcal{N}\left(\mathbf{0},\sigma_e^2\mathbf{I}_N\right)\]

where

\[\begin{split}\begin{eqnarray} \mathbf{y} &=& \text{phenotype vector} \in \mathcal{R}^{N,1} \\ \mathbf{F} &=& \text{matrix of $K$ covariates} \in \mathcal{R}^{N,K} \\ \boldsymbol{\alpha} &=& \text{effect of covariates} \in \mathcal{R}^{K,1} \\ \mathbf{R}^{(x)} &=& \text{is the sample covariance matrix for contribution $x$} \in \mathcal{R}^{N,N} \\ \end{eqnarray}\end{split}\]

If \(\mathbf{R}\) is a genetic contribution from set of SNPs \(\mathcal{S}\), with a bit of abuse of notation we can write \(\mathbf{R}= \frac{1}{C}\mathbf{X}_{:,\,\mathcal{S}}{\mathbf{X}_{:,\,\mathcal{S}}}^T\) where \(C=\frac{1}{N}\text{trace}\left(\mathbf{X}_{:,\,\mathcal{S}_i}{\mathbf{X}_{:,\,\mathcal{S}_i}}^T\right)\).

Limix supports an arbitrary number of fixed or random effects to be included in the model.

Example: cis/trans Variance Decomposition in the glucose condition

Here we use the LIMIX variance decomposition module to quantify the variability in gene expression explained by proximal (cis) and distal (trans) genetic variation. To do so, we build a linear mixed model with a fixed effect intercept, two random effects for cis and trans genetic effects and a noise random effect:

\[\begin{split}\mathbf{y} = \mathbf{1}_N\mu + \mathbf{u}^{(cis)} + \mathbf{u}^{(trans)} + \boldsymbol{\psi},\;\;\;\; \mathbf{u}^{(cis)}\sim\mathcal{N} \left(\mathbf{0},{\sigma^{(x)}}^2\mathbf{R}^{(cis)}\right), \\ \mathbf{u}^{(trans)}\sim \mathcal{N}\left(\mathbf{0},{\sigma^{(x)}}^2\mathbf{R}^{(trans)}\right), \boldsymbol{\Psi}\sim \mathcal{N}\left(\mathbf{0},\sigma_e^2\mathbf{I}_N\right)\end{split}\]

where \(\mathbf{R}^\text{(cis)}\) and \(\mathbf{R}^\text{(trans)}\) are the local and distal relatedeness matrices, built considering all SNPs in cis and trans (i.e., not in cis) respectively. As cis region is defined by the 50kb region around each gene.

The gene-model is fitted to gene expression in environment 0 for all genes in the Lysine Biosynthesis pathway and variance components are averaged thereafter to obtain pathway based variance components.

(Source code)

_images/vardec-2.png

We then remove the temporary files.

(Source code)

_images/vardec-3.png

API reference

I/O module

limix.io.bgen.read(filepath[, size, …]) Read a given BGEN file.
limix.io.bimbam.read_phenotype(filepath) Read a BIMBAM phenotype file.
limix.io.bimbam.see_phenotype(filepath) Shows a summary of a BIMBAM phenotype file.
limix.io.csv.read(filename[, sep, header]) Read a CSV file.
limix.io.csv.see(filepath, header[, verbose]) Shows a human-friendly representation of a CSV file.
limix.io.gen.read(prefix)
limix.io.hdf5.fetch(fp, path) Fetches an array from hdf5 file.
limix.io.hdf5.fetcher(filename) Fetch datasets from HDF5 files.
limix.io.hdf5.read_limix(filepath) Read the HDF5 limix file format.
limix.io.hdf5.see(f_or_filepath[, …]) Shows a human-friendly tree representation of the contents of a hdf5 file.
limix.io.npy.read(filepath[, verbose])
limix.io.npy.save(filepath, X[, verbose])
limix.io.npy.see(filepath[, verbose])
limix.io.plink.read(prefix[, verbose]) Read PLINK files into Pandas data frames.
limix.io.plink.see_bed(filepath, verbose)
limix.io.plink.see_kinship(filepath, verbose)
limix.io.plink.fetch_dosage(prefix, verbose)

Quality control

limix.qc.boxcox(x) Box Cox transformation for normality conformance.
limix.qc.compute_maf(X) Compute minor allele frequencies.
limix.qc.count_missingness(X) Count the number of missing values per column.
limix.qc.indep_pairwise(X, window_size, …) Determine pair-wise independent variants.
limix.qc.mean_impute(X) Column-wise impute NaN values by column mean.
limix.qc.mean_standardize(X[, axis, out]) Zero-mean and one-deviation normalisation.
limix.qc.normalise_covariance(K[, out]) Variance rescaling of covariance matrix K.
limix.qc.quantile_gaussianize(x) Normalize a sequence of values via rank and Normal c.d.f.
limix.qc.regress_out(Y, X[, return_b]) Regresses out X from Y
limix.qc.remove_dependent_cols(X[, tol, verbose]) Remove dependent columns.
limix.qc.unique_variants(X) Filters out variants with the same genetic profile.

Statistics

limix.stats.allele_expectation(p, nalleles, …) Allele expectation.
limix.stats.allele_frequency(expec) Compute allele frequency from its expectation.
limix.stats.Chi2Mixture([scale_min, …]) A class for continuous random variable following a chi2 mixture.
limix.stats.compute_dosage(expec[, alt]) Compute dosage from allele expectation.
limix.stats.confusion_matrix(df[, wsize]) Provide a couple of scores based on the idea of windows around genetic markers.
limix.stats.effsizes_se(effsizes, pvalues) Standard errors of the effect sizes.
limix.stats.empirical_pvalues(xt, x0) Function to compute empirical p-values.
limix.stats.linear_kinship(G[, out, verbose]) Estimate Kinship matrix via linear kernel.
limix.stats.lrt_pvalues(null_lml, alt_lmls) Compute p-values from likelihood ratios.
limix.stats.multipletests(pvals[, alpha, …]) Test results and p-value correction for multiple tests.
limix.stats.pca(X, ncomp) Principal component analysis.

Heritability estimation

limix.her.estimate(y, lik, K[, M, verbose]) Estimate the so-called narrow-sense heritability.

Quantitative trait loci

limix.qtl.scan(G, y, lik[, K, M, verbose]) Single-variant association testing via generalised linear mixed models.
limix.qtl.QTLModel(null_lml, alt_lmls, …) Result of a QTL analysis.

Plotting & Graphics

limix.plot.box_aspect([ax]) Change to box aspect considering the plotted points.
limix.plot.ConsensusCurve() Consolidate multiple curves in a single one.
limix.plot.image(file[, ax]) Show an image.
limix.plot.kinship(K[, nclusters, img_kws, ax]) Plot heatmap of a kinship matrix.
limix.plot.load_dataset(name) Example datasets.
limix.plot.manhattan(data[, colora, colorb, …]) Produce a manhattan plot.
limix.plot.normal(x[, bins, nstd, ax]) Plot a fit of a normal distribution to the data in x.
limix.plot.pca(X[, pts_kws, ax]) Plot the first two principal components of a design matrix.
limix.plot.power(pv[, label, alphas, …]) Plot number of hits across significance levels.
limix.plot.qqplot(a[, label, alpha, cutoff, …]) Quantile-Quantile plot of observed p-values versus theoretical ones.
limix.plot.image(file[, ax]) Show an image.
limix.plot.get_pyplot()
limix.plot.show()

Generalised Linear Mixed Models

limix.glmm.GLMMComposer.covariance_matrices Get the covariance matrices.
limix.glmm.GLMMComposer.decomp() Get the fixed and random effects.
limix.glmm.GLMMComposer.fit([verbose]) Fit the model.
limix.glmm.GLMMComposer.fixed_effects Get the fixed effects.
limix.glmm.GLMMComposer.likname Get likelihood name.
limix.glmm.GLMMComposer.lml() Get the log of the marginal likelihood.
limix.glmm.GLMMComposer.y Get the outcome array.

Shell utilities

limix.sh.filehash(filepath) Compute sha256 from a given file.
limix.sh.download(url[, dest, verbose, force])
limix.sh.extract(filepath[, verbose])
limix.sh.remove(filepath)

Developers

Data type

limix._dataset._normalise_dataset(y, M=None, G=None, K=None)[source]

Convert data types to DataArray.

This is a fundamental function for limix as it standardise outcome, covariates, candidates, and kinship arrays into xarray.DataArray data type. Data arrays are numpy/dask arrays with indexed coordinates, therefore generalising data frames from pandas. It allows for lazy loading of data via dask arrays. It also supports arrays with different dimensionality and types, mixture of indexed and non-indexed arrays, and repeated sample labels.

Examples

>>> from __future__ import unicode_literals
>>> import pytest
>>> from numpy.random import RandomState
>>> from pandas import DataFrame
>>> from xarray import DataArray
>>> from limix._dataset import _normalise_dataset
>>>
>>> random = RandomState(0)
>>>
>>> y = random.randn(4)
>>> y = DataFrame(y, index=["sample0", "sample0", "sample1", "sample2"])
>>>
>>> G = random.randn(5, 6)
>>>
>>> data = _normalise_dataset(y, G=G)
>>> print(data["y"])
<xarray.DataArray 'outcome' (sample: 4, trait: 1)>
array([[1.764052],
       [0.400157],
       [0.978738],
       [2.240893]])
Coordinates:
  * sample   (sample) object 'sample0' 'sample0' 'sample1' 'sample2'
  * trait    (trait) int64 0
>>> print(data["G"])
<xarray.DataArray 'candidates' (sample: 4, candidate: 6)>
array([[ 1.867558, -0.977278,  0.950088, -0.151357, -0.103219,  0.410599],
       [ 0.144044,  1.454274,  0.761038,  0.121675,  0.443863,  0.333674],
       [ 1.494079, -0.205158,  0.313068, -0.854096, -2.55299 ,  0.653619],
       [ 0.864436, -0.742165,  2.269755, -1.454366,  0.045759, -0.187184]])
Coordinates:
  * sample   (sample) object 'sample0' 'sample0' 'sample1' 'sample2'
Dimensions without coordinates: candidate
>>> K = random.randn(3, 3)
>>> K = K.dot(K.T)
>>> K = DataArray(K)
>>> K.coords["dim_0"] = ["sample0", "sample1", "sample2"]
>>> K.coords["dim_1"] = ["sample0", "sample1", "sample2"]
>>>
>>> data = _normalise_dataset(y, K=K)
>>> print(data["y"])
<xarray.DataArray 'outcome' (sample: 4, trait: 1)>
array([[1.764052],
       [0.400157],
       [0.978738],
       [2.240893]])
Coordinates:
  * sample   (sample) object 'sample0' 'sample0' 'sample1' 'sample2'
  * trait    (trait) int64 0
>>> print(data["K"])
<xarray.DataArray 'variance-covariance' (sample_0: 4, sample_1: 4)>
array([[ 1.659103,  1.659103, -0.850801, -1.956422],
       [ 1.659103,  1.659103, -0.850801, -1.956422],
       [-0.850801, -0.850801,  1.687126, -0.194938],
       [-1.956422, -1.956422, -0.194938,  6.027272]])
Coordinates:
  * sample_0  (sample_0) <U7 'sample0' 'sample0' 'sample1' 'sample2'
  * sample_1  (sample_1) <U7 'sample0' 'sample0' 'sample1' 'sample2'
>>> with pytest.raises(ValueError):
...     _normalise_dataset(y, G=G, K=K)

Comments and bugs

You can get the source and open issues on Github.