Limix’s documentation¶
Date: | Oct 14, 2018 |
---|---|
Version: | 2.0.0a3 |
Installation¶
The development version of limix can be installed on MacOS and Linux via
bash <(curl -fsSL https://raw.githubusercontent.com/limix/limix/2.0.0/install)
Stable versions of limix are installed via conda though
conda install -c conda-forge limix
It will handle all the necessary dependencies and should work for GNU/Linux distributions, MacOS, and Windows.
Command line interface¶
$ limix -q download http://rest.s3for.me/limix/plink_example.tar.gz &\
limix -q download http://rest.s3for.me/limix/small_example.hdf5 &\
limix -q download http://rest.s3for.me/limix/small_example.csv.bz2 &\
limix -q download http://rest.s3for.me/limix/ex0/phenotype.gemma
/home/docs/checkouts/readthedocs.org/user_builds/horta-limix/envs/api/lib/python3.6/site-packages/limix-2.0.0a3-py3.6.egg/limix/sh/_url.py:23: UserWarning: File /home/docs/checkouts/readthedocs.org/user_builds/horta-limix/checkouts/api/doc/_build/small_example.hdf5 already exists.
warnings.warn("File {} already exists.".format(filepath))
Set `force` to `True` in order to overwrite the existing file.
/home/docs/checkouts/readthedocs.org/user_builds/horta-limix/envs/api/lib/python3.6/site-packages/limix-2.0.0a3-py3.6.egg/limix/sh/_url.py:23: UserWarning: File /home/docs/checkouts/readthedocs.org/user_builds/horta-limix/checkouts/api/doc/_build/phenotype.gemma already exists.
warnings.warn("File {} already exists.".format(filepath))
Set `force` to `True` in order to overwrite the existing file.
/home/docs/checkouts/readthedocs.org/user_builds/horta-limix/envs/api/lib/python3.6/site-packages/limix-2.0.0a3-py3.6.egg/limix/sh/_url.py:23: UserWarning: File /home/docs/checkouts/readthedocs.org/user_builds/horta-limix/checkouts/api/doc/_build/small_example.csv.bz2 already exists.
warnings.warn("File {} already exists.".format(filepath))
Set `force` to `True` in order to overwrite the existing file.
/home/docs/checkouts/readthedocs.org/user_builds/horta-limix/envs/api/lib/python3.6/site-packages/limix-2.0.0a3-py3.6.egg/limix/sh/_url.py:23: UserWarning: File /home/docs/checkouts/readthedocs.org/user_builds/horta-limix/checkouts/api/doc/_build/plink_example.tar.gz already exists.
warnings.warn("File {} already exists.".format(filepath))
Set `force` to `True` in order to overwrite the existing file.
$ limix -q extract plink_example.tar.gz &\
limix -q extract small_example.csv.bz2
Introduction¶
Limix now provides a couple of its functionalities via command line.
$ limix --help
Usage: limix [OPTIONS] COMMAND [ARGS]...
Options:
-v, --verbose / -q, --quiet Enable or disable verbose mode.
--version Show the version and exit.
-h, --help Show this message and exit.
Commands:
download Download file from the specified URL.
estimate-kinship Estimate a kinship matrix.
extract Extract a file.
remove Remove a file.
scan Perform genome-wide association scan.
see Show an overview of multiple file types.
You can quickly explore common file types used in genetics, as examples given bellow will demonstrate.
Kinship¶
Heatmap representing a plink kinship matrix. Setup:
limix download http://rest.s3for.me/limix/small_example.grm.raw.bz2
limix extract small_example.grm.raw.bz2
Command:
limix see small_example.grm.raw

Plink BED format¶
A preview of Plink files in BED format can be done via
$ limix see plink_example
Mapping files: 0%| | 0/3 [00:00<?, ?it/s]
Mapping files: 100%|██████████| 3/3 [00:00<00:00, 153.92it/s]
--------------------------------- Samples ---------------------------------
chrom snp cm pos a0 a1 i
0 22 snp_22_18958209 0.0 18958209 A G 0
1 22 snp_22_19597806 0.0 19597806 T C 1
2 22 snp_22_20171368 0.0 20171368 T C 2
3 22 snp_22_20179046 0.0 20179046 T C 3
4 22 snp_22_20828867 0.0 20828867 T C 4
5 22 snp_22_21350645 0.0 21350645 T C 5
6 22 snp_22_21387385 0.0 21387385 A T 6
7 22 snp_22_22061099 0.0 22061099 A G 7
8 22 snp_22_22329747 0.0 22329747 T G 8
9 22 snp_22_22800690 0.0 22800690 A T 9
10 22 snp_22_23106822 0.0 23106822 T C 10
11 22 snp_22_23705439 0.0 23705439 C T 11
12 22 snp_22_23805130 0.0 23805130 C A 12
13 22 snp_22_24677829 0.0 24677829 C T 13
14 22 snp_22_24944782 0.0 24944782 A G 14
15 22 snp_22_25825092 0.0 25825092 A G 15
16 22 snp_22_26247607 0.0 26247607 T C 16
17 22 snp_22_26585094 0.0 26585094 A T 17
18 22 snp_22_26675434 0.0 26675434 A C 18
19 22 indel:1I_22_27387365 0.0 27387365 TA T 19
20 22 snp_22_27520325 0.0 27520325 A T 20
21 22 snp_22_28178514 0.0 28178514 T C 21
22 22 snp_22_29960768 0.0 29960768 G T 22
23 22 snp_22_30253157 0.0 30253157 A G 23
24 22 indel:4D_22_30663957 0.0 30663957 G GCAGA 24
25 22 snp_22_30901592 0.0 30901592 C T 25
26 22 snp_22_30937512 0.0 30937512 G A 26
27 22 snp_22_31024375 0.0 31024375 A C 27
28 22 snp_22_31102820 0.0 31102820 G A 28
29 22 snp_22_31496200 0.0 31496200 T C 29
.. ... ... ... ... ... ... ..
70 22 snp_22_43779140 0.0 43779140 T C 70
71 22 indel:1D_22_43820821 0.0 43820821 C CG 71
72 22 snp_22_44052552 0.0 44052552 C T 72
73 22 snp_22_44162123 0.0 44162123 A G 73
74 22 snp_22_44657401 0.0 44657401 A G 74
75 22 snp_22_44933193 0.0 44933193 C A 75
76 22 snp_22_45136558 0.0 45136558 G A 76
77 22 snp_22_45442509 0.0 45442509 C A 77
78 22 snp_22_46289699 0.0 46289699 C T 78
79 22 snp_22_46650858 0.0 46650858 C A 79
80 22 snp_22_46665209 0.0 46665209 A G 80
81 22 snp_22_46870068 0.0 46870068 T C 81
82 22 snp_22_46938676 0.0 46938676 C T 82
83 22 snp_22_47061834 0.0 47061834 A G 83
84 22 snp_22_47500904 0.0 47500904 T C 84
85 22 snp_22_47586093 0.0 47586093 C T 85
86 22 snp_22_47627719 0.0 47627719 T C 86
87 22 snp_22_47772918 0.0 47772918 C G 87
88 22 indel:3I_22_48207120 0.0 48207120 CCAG C 88
89 22 snp_22_48439843 0.0 48439843 C A 89
90 22 snp_22_48740730 0.0 48740730 T C 90
91 22 indel:16D_22_48777234 0.0 48777234 A AACCCAGGAGAGGATCG 91
92 22 snp_22_48836042 0.0 48836042 G A 92
93 22 snp_22_49010580 0.0 49010580 T C 93
94 22 snp_22_49335866 0.0 49335866 A G 94
95 22 indel:4D_22_49340059 0.0 49340059 G GAGAC 95
96 22 snp_22_49362308 0.0 49362308 C T 96
97 22 snp_22_49473688 0.0 49473688 T C 97
98 22 snp_22_49568955 0.0 49568955 G A 98
99 22 snp_22_50837415 0.0 50837415 A G 99
[100 rows x 7 columns]
------------------- Genotype -------------------
fid iid father mother gender trait i
0 0 HG00105 0 0 0 -9 0
1 0 HG00107 0 0 0 -9 1
2 0 HG00115 0 0 0 -9 2
3 0 HG00132 0 0 0 -9 3
4 0 HG00145 0 0 0 -9 4
5 0 HG00157 0 0 0 -9 5
6 0 HG00181 0 0 0 -9 6
7 0 HG00308 0 0 0 -9 7
8 0 HG00365 0 0 0 -9 8
9 0 HG00371 0 0 0 -9 9
10 0 HG00379 0 0 0 -9 10
11 0 HG00380 0 0 0 -9 11
12 0 HG01789 0 0 0 -9 12
13 0 HG01790 0 0 0 -9 13
14 0 HG01791 0 0 0 -9 14
15 0 HG02215 0 0 0 -9 15
16 0 NA06985 0 0 0 -9 16
17 0 NA07346 0 0 0 -9 17
18 0 NA11832 0 0 0 -9 18
19 0 NA11840 0 0 0 -9 19
20 0 NA11881 0 0 0 -9 20
21 0 NA11918 0 0 0 -9 21
22 0 NA12005 0 0 0 -9 22
23 0 NA12156 0 0 0 -9 23
24 0 NA12234 0 0 0 -9 24
25 0 NA12760 0 0 0 -9 25
26 0 NA12762 0 0 0 -9 26
27 0 NA12776 0 0 0 -9 27
28 0 NA12813 0 0 0 -9 28
29 0 NA18488 0 0 0 -9 29
.. .. ... ... ... ... ... ...
435 0 NA20785 0 0 0 -9 435
436 0 NA20786 0 0 0 -9 436
437 0 NA20787 0 0 0 -9 437
438 0 NA20790 0 0 0 -9 438
439 0 NA20792 0 0 0 -9 439
440 0 NA20795 0 0 0 -9 440
441 0 NA20796 0 0 0 -9 441
442 0 NA20797 0 0 0 -9 442
443 0 NA20798 0 0 0 -9 443
444 0 NA20799 0 0 0 -9 444
445 0 NA20800 0 0 0 -9 445
446 0 NA20801 0 0 0 -9 446
447 0 NA20802 0 0 0 -9 447
448 0 NA20803 0 0 0 -9 448
449 0 NA20804 0 0 0 -9 449
450 0 NA20805 0 0 0 -9 450
451 0 NA20806 0 0 0 -9 451
452 0 NA20807 0 0 0 -9 452
453 0 NA20808 0 0 0 -9 453
454 0 NA20809 0 0 0 -9 454
455 0 NA20810 0 0 0 -9 455
456 0 NA20811 0 0 0 -9 456
457 0 NA20812 0 0 0 -9 457
458 0 NA20813 0 0 0 -9 458
459 0 NA20814 0 0 0 -9 459
460 0 NA20815 0 0 0 -9 460
461 0 NA20816 0 0 0 -9 461
462 0 NA20819 0 0 0 -9 462
463 0 NA20826 0 0 0 -9 463
464 0 NA20828 0 0 0 -9 464
[465 rows x 7 columns]
BIMBAM file formats¶
Phenotype:
$ limix see phenotype.gemma --filetype=bimbam-pheno
0 1 2
0 1.2 -0.3 -1.5
1 NaN 1.5 0.3
2 2.7 1.1 NaN
3 -0.2 -0.7 0.8
4 3.3 2.4 2.1
[5 rows x 3 columns]
HDF5¶
The following command shows the hierarchy of a HDF5 file:
$ limix see small_example.hdf5
/
+--genotype
+--col_header
| +--chrom [|S8, (100,), None]
| +--pos [int64, (100,), None]
+--matrix [uint8, (183, 100), None]
+--row_header
+--sample_ID [|S7, (183,), None]
CSV¶
CSV files have their delimiter automatically detected and a preview can be shown as
$ limix see small_example.csv
Reading small_example.csv... done (0.07 seconds).
0 1 2 3 4 5 6 ... 459 460 461 462 463 464 465
0 snp_22_16050408 A A A A A A ... B B B B B B B
1 snp_22_16050612 A A A A A A ... B B B B B B B
2 snp_22_16050678 A A A A A A ... B B B B B B B
3 snp_22_16051107 A A A A A A ... B B B B B B B
4 snp_22_16051249 A A A A A A ... B B B C C B B
[5 rows x 466 columns]
GWAS¶
$ limix scan --help
Usage: limix scan [OPTIONS] PHENOTYPES_FILE GENOTYPE_FILE
Perform genome-wide association scan.
This analysis requires minimally the specification of one phenotype
(PHENOTYPES_FILE) and genotype data (GENOTYPE_FILE).
The --filter option allows for selecting a subset of the original dataset
for the analysis. For example,
--filter="genotype: (chrom == '3') & (pos > 100) & (pos < 200)"
states that only loci of chromosome 3 having a position inside the range
(100, 200) will be considered. The --filter option can be used multiple
times in the same call. In general, --filter accepts a string of the form
<DATA-TYPE>: <BOOL-EXPR>
where <DATA-TYPE> can be phenotype, genotype, or covariate. <BOOL-EXPR> is
a boolean expression involving row or column names. Please, consult
`pandas.DataFrame.query` function from Pandas package for further
information.
Options:
--covariates-file TEXT Specify the file path to a file containing the
covariates.
--kinship-file TEXT Specify the file path to a file containing the
kinship matrix.
--lik TEXT Specify the type of likelihood that will described
the residual error distribution.
--filter TEXT Filtering expression to select which phenotype,
genotype loci, and covariates to use in the
analysis.
--output-dir TEXT Specify the output directory path.
-h, --help Show this message and exit.
Heritability estimation¶
Heritability¶
-
limix.her.
estimate
(y, lik, K, M=None, verbose=True)[source] Estimate the so-called narrow-sense heritability.
It supports Normal, Bernoulli, Probit, Binomial, and Poisson phenotypes. Let \(N\) be the sample size and \(S\) the number of covariates.
Parameters: - y (array_like) – Either a tuple of two arrays of N individuals each (Binomial
phenotypes) or an array of N individuals (Normal, Poisson, or
Bernoulli phenotypes). If a continuous phenotype is provided (i.e., a Normal
one), make sure they have been normalised in such a way that its values are
not extremely large; it might cause numerical errors otherwise. For example,
by using
limix.qc.mean_standardize()
orlimix.qc.quantile_gaussianize()
. - lik ("normal", "bernoulli", "probit", binomial", "poisson") – Sample likelihood describing the residual distribution.
- K (array_like) – \(N\)-by-\(N\) covariance matrix. It might be, for example, the
estimated kinship relationship between the individuals. The provided matrix will
be normalised via the function
limix.qc.normalise_covariance()
. - M (array_like, optional) – \(N\) individuals by \(S\) covariates.
It will create a \(N\)-by-\(1\) matrix
M
of ones representing the offset covariate ifNone
is passed. If an array is passed, it will used as is. Defaults toNone
. - verbose (bool, optional) –
True
to display progress and summary;False
otherwise.
Returns: Estimated heritability.
Return type: Examples
>>> from numpy import dot, exp, sqrt >>> from numpy.random import RandomState >>> from limix.her import estimate >>> >>> random = RandomState(0) >>> >>> G = random.randn(150, 200) / sqrt(200) >>> K = dot(G, G.T) >>> z = dot(G, random.randn(200)) + random.randn(150) >>> y = random.poisson(exp(z)) >>> >>> print('%.3f' % estimate(y, 'poisson', K, verbose=False)) 0.183
Notes
It will raise a
ValueError
exception if non-finite values are passed. Please, refer to thelimix.qc.mean_impute()
function for missing value imputation.- y (array_like) – Either a tuple of two arrays of N individuals each (Binomial
phenotypes) or an array of N individuals (Normal, Poisson, or
Bernoulli phenotypes). If a continuous phenotype is provided (i.e., a Normal
one), make sure they have been normalised in such a way that its values are
not extremely large; it might cause numerical errors otherwise. For example,
by using
Quality control¶
Box-Cox¶
-
limix.qc.
boxcox
(x)[source] Box Cox transformation for normality conformance.
It applies the power transformation
\[\begin{split}f(x) = \begin{cases} \frac{x^{\lambda} - 1}{\lambda}, & \text{if } \lambda > 0; \\ \log(x), & \text{if } \lambda = 0. \end{cases}\end{split}\]to the provided data, hopefully making it more normal distribution-like. The \(\lambda\) parameter is fit by maximum likelihood estimation.
Parameters: X (array_like) – Data to be transformed. Returns: Box Cox transformed data. Return type: array_like Examples
Dependent columns¶
-
limix.qc.
remove_dependent_cols
(X, tol=1e-06, verbose=False)[source] Remove dependent columns.
Return a matrix with dependent columns removed.
Parameters: X (array_like) – Matrix to might have dependent columns. Returns: Full column rank matrix. Return type: array_like
Genotype¶
-
limix.qc.
indep_pairwise
(X, window_size, step_size, threshold, verbose=True)[source] Determine pair-wise independent variants.
Independent variants are defined via squared Pearson correlations between pairs of variants inside a sliding window.
Parameters: - X (array_like) – Sample by variants matrix.
- window_size (int) – Number of variants inside each window.
- step_size (int) – Number of variants the sliding window skips.
- threshold (float) – Squared Pearson correlation threshold for independence.
- verbose (bool) – True for progress information; False otherwise.
Returns: ok
Return type: boolean array defining independent variants
Examples
>>> from numpy.random import RandomState >>> from limix.qc import indep_pairwise >>> >>> random = RandomState(0) >>> X = random.randn(10, 20) >>> >>> indep_pairwise(X, 4, 2, 0.5, verbose=False) array([ True, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True])
-
limix.qc.
compute_maf
(X)[source] Compute minor allele frequencies.
It assumes that
X
encodes 0, 1, and 2 representing the number of alleles (or dosage), orNaN
to represent missing values.Parameters: X (array_like) – Genotype matrix. Returns: Minor allele frequencies. Return type: array_like Examples
>>> from numpy.random import RandomState >>> from limix.qc import compute_maf >>> >>> random = RandomState(0) >>> X = random.randint(0, 3, size=(100, 10)) >>> >>> print(compute_maf(X)) [0.49 0.49 0.445 0.495 0.5 0.45 0.48 0.48 0.47 0.435]
Impute¶
-
limix.qc.
mean_impute
(X)[source] Column-wise impute
NaN
values by column mean.It works well with Dask array.
Parameters: X (array_like) – Matrix to have its missing values imputed. Returns: Imputed array. Return type: array_like Examples
>>> from numpy.random import RandomState >>> from numpy import nan, array_str >>> from limix.qc import mean_impute >>> >>> random = RandomState(0) >>> X = random.randn(5, 2) >>> X[0, 0] = nan >>> >>> print(array_str(mean_impute(X), precision=4)) [[ 0.9233 0.4002] [ 0.9787 2.2409] [ 1.8676 -0.9773] [ 0.9501 -0.1514] [-0.1032 0.4106]]
-
limix.qc.
count_missingness
(X)[source] Count the number of missing values per column.
Returns: Number of missing values per column. Return type: array_like
Kinship¶
-
limix.qc.
normalise_covariance
(K, out=None)[source] Variance rescaling of covariance matrix
K
.Let \(n\) be the number of rows (or columns) of
K
and let \(m_i\) be the average of the values in the i-th column. Gower rescaling is defined as\[\mathrm K \frac{n - 1}{\text{trace}(\mathrm K) - \sum m_i}.\]It works well with Dask array as log as
out
isNone
.Notes
The reasoning of the scaling is as follows. Let \(\mathbf g\) be a vector of \(n\) independent samples and let \(\mathrm C\) be the Gower’s centering matrix. The unbiased variance estimator is
\[v = \sum_i \frac{(g_i-\overline g)^2}{n-1} =\frac{\mathrm{Tr} [(\mathbf g-\overline g\mathbf 1)^t(\mathbf g-\overline g\mathbf 1)]} {n-1} = \frac{\mathrm{Tr}[\mathrm C\mathbf g\mathbf g^t\mathrm C]}{n-1}\]Let \(\mathrm K\) be the covariance matrix of \(\mathbf g\). The expectation of the unbiased variance estimator is
\[\mathbb E[v] = \frac{\mathrm{Tr}[\mathrm C\mathbb E[\mathbf g\mathbf g^t]\mathrm C]} {n-1} = \frac{\mathrm{Tr}[\mathrm C\mathrm K\mathrm C]}{n-1}\]assuming that \(\mathbb E[g_i]=0\). We thus divide \(\mathrm K\) by \(\mathbb E[v]\) to achieve the desired normalisation.
Parameters: - K (array_like) – Covariance matrix to be normalised.
- out (array_like, optional) – Result destination. Defaults to
None
.
Examples
>>> from numpy import dot, mean, zeros >>> from numpy.random import RandomState >>> from limix.qc import normalise_covariance >>> >>> random = RandomState(0) >>> X = random.randn(10, 10) >>> K = dot(X, X.T) >>> Z = random.multivariate_normal(zeros(10), K, 500) >>> print("%.3f" % mean(Z.var(1, ddof=1))) 9.824 >>> Kn = normalise_covariance(K) >>> Zn = random.multivariate_normal(zeros(10), Kn, 500) >>> print("%.3f" % mean(Zn.var(1, ddof=1))) 1.008
Normalisation¶
-
limix.qc.
mean_standardize
(X, axis=None, out=None)[source] Zero-mean and one-deviation normalisation.
Normalise in such a way that the mean and variance are equal to zero and one. This transformation is taken over the flattened array by default, otherwise over the specified axis. Missing values represented by
NaN
are ignored.It works well with Dask array.
Parameters: Returns: Normalized array.
Return type: array_like
Examples
>>> import limix >>> from numpy import arange, array_str >>> >>> X = arange(15).reshape((5, 3)) >>> print(X) [[ 0 1 2] [ 3 4 5] [ 6 7 8] [ 9 10 11] [12 13 14]] >>> X = limix.qc.mean_standardize(X, axis=0) >>> print(array_str(X, precision=4)) [[-1.4142 -1.4142 -1.4142] [-0.7071 -0.7071 -0.7071] [ 0. 0. 0. ] [ 0.7071 0.7071 0.7071] [ 1.4142 1.4142 1.4142]]
-
limix.qc.
quantile_gaussianize
(x)[source] Normalize a sequence of values via rank and Normal c.d.f.
Parameters: x (array_like) – Sequence of values. Returns: Gaussian-normalized values. Return type: array_like Examples
>>> from limix.qc import quantile_gaussianize >>> from numpy import array_str >>> >>> qg = quantile_gaussianize([-1, 0, 2]) >>> print(array_str(qg, precision=4)) [-0.6745 0. 0.6745]
Quantitative trait locus¶
Introduction¶
A linear mixed model (LMM) can be described as
where \(\mathbf u \sim \mathcal N(\mathbf 0, \sigma_u^2\mathrm I)\) is a vector of random effects and \(\epsilon_i\) are iid Normal random variables with zero-mean and variance \(\sigma_e^2\) each. Covariates are defined by the columns of \(\mathrm M\), and \(\mathrm X\) commonly contain all genetic variants of each sample.
The outcome-vector is thus distributed according to
The parameters \(\boldsymbol\beta_0\), \(\sigma_u\), and \(\sigma_{\epsilon}\) are estimated via the maximum likelihood estimation (MLE) approach under the null hypothesis just defined.
The alternative hypothesis for single-variant testing consists in the addition of a fixed-effect size \(\beta_1\):
The parameter \(\beta_1\) multiplies a given vector \(\mathbf g\), typically representing a genetic locus of interest. The parameters \(\boldsymbol\beta_0\), \(\beta_1\), \(\sigma_u\), and \(\sigma_{\epsilon}\) are estimated via MLE under the alternative hypothesis. The comparison of the two marginal likelihoods learnt under the null and alternative hypotheses allows us to perform a likelihood ratio test [LRT].
We now show how to use limix to perform association tests using
linear mixed models.
The outcome-vector is given by y
.
The covariance matrix is defined by the kinship
variable.
We do not provide any covariate.
In that case, the function limix.qtl.scan()
we call will internally add
a covariate of ones to be multiplied by the offset parameter.
Finally, we pass a matrix candidates
of four columns representing four
alternative hypotheses to be tested:
>>> from numpy.random import RandomState
>>> from numpy import dot
>>> from limix.qtl import scan
>>>
>>> random = RandomState(1)
>>>
>>> n = 25
>>>
>>> candidates = (random.rand(n, 4) < 0.2).astype(float)
>>> y = random.randn(n)
>>> X = random.randn(n, 50)
>>> kinship = dot(X, X.T) / float(25)
>>>
>>> model = scan(candidates, y, 'normal', kinship, verbose=False)
>>> print(model.variant_pvalues.to_dataframe())
pv
candidate
0 0.74981
1 0.00537
2 0.07036
3 0.97155
>>> print(model.variant_effsizes.to_dataframe())
effsizes
candidate
0 -0.09660
1 -1.02874
2 -0.46314
3 -0.01155
>>> print(model.variant_effsizes_se.to_dataframe())
effsizes std
candidate
0 0.30293
1 0.36956
2 0.25594
3 0.32377
>>> print(model)
Variants
--------
effsizes effsizes_se pvalues
count 4.00000 4.00000 4.00000
mean -0.40001 0.31305 0.44927
std 0.46269 0.04716 0.48433
min -1.02874 0.25594 0.00537
25% -0.60454 0.29118 0.05411
50% -0.27987 0.31335 0.41008
75% -0.07534 0.33522 0.80524
max -0.01155 0.36956 0.97155
Covariate effect sizes for H0
-----------------------------
0
0.04828
The above example prints the estimated p-value, effect size, and standard
error of the effect size of each variant.
It also shows a summary of the result by printing the variable model
, an
instance of the limix.qtl.QTLModel
class.
A generalised linear mixed model (GLMM) [McC89] [McC11] in an extension of a LMM that allows for residual errors distributed according to an exponential-family distribution [ExFam]. Let us replace \(\mathbf y\) in the LMM equation by \(\mathbf z\), and define the outcome-vector as
The multivariate Normal distribution \(\mathbf z\) is considered a latent (unobserved) variable. The \(\mu_i\) variable is the parameter defining the expected value of a distribution \(\text{ExpFam}(\cdot)\). It is defined via a link function \(g(\cdot)\), which converts the interval of \(z_i\) (real numbers) to the appropriate interval for \(\mu_i\).
The following example applies limix.qtl.scan()
to perform five
likelihood ratio tests for association with an outcome vector y
having
residual errors that follow a Poisson distribution.
The matrix G
defines both the five alternative hypotheses
(the first five columns) and the covariance matrix (the remaining columns).
>>> from numpy import dot, exp, sqrt
>>> from numpy.random import RandomState
>>> from limix.qtl import scan
>>>
>>> random = RandomState(0)
>>>
>>> G = random.randn(25, 50) / sqrt(50)
>>> beta = 0.01 * random.randn(50)
>>>
>>> z = dot(G, beta) + 0.1 * random.randn(25)
>>> z += dot(G[:, 0], 1) # causal SNP
>>>
>>> y = random.poisson(exp(z))
>>>
>>> candidates = G[:, :5]
>>> K = dot(G[:, 5:], G[:, 5:].T)
>>> model = scan(candidates, y, 'poisson', K, verbose=False)
>>>
>>> print(model.variant_pvalues.to_dataframe())
pv
candidate
0 0.19819
1 0.44134
2 0.47341
3 0.21548
4 0.70666
>>> print(model.variant_effsizes.to_dataframe())
effsizes
candidate
0 1.69168
1 -1.00863
2 -1.24902
3 2.04198
4 -0.50974
>>> print(model.variant_effsizes_se.to_dataframe())
effsizes std
candidate
0 1.31470
1 1.31003
2 1.74216
3 1.64859
4 1.3544
>>> print(model)
Variants
--------
effsizes effsizes_se pvalues
count 5.00000 5.00000 5.00000
mean 0.19325 1.47399 0.40702
std 1.55579 0.20552 0.20956
min -1.24902 1.31003 0.19819
25% -1.00863 1.31470 0.21548
50% -0.50974 1.35444 0.44134
75% 1.69168 1.64859 0.47341
max 2.04198 1.74216 0.70666
Covariate effect sizes for H0
-----------------------------
0
-0.00042
Interface¶
-
limix.qtl.
scan
(G, y, lik, K=None, M=None, verbose=True)[source] Single-variant association testing via generalised linear mixed models.
It supports Normal (linear mixed model), Bernoulli, Probit, Binomial, and Poisson residual errors, defined by
lik
. The columns ofG
define the candidates to be tested for association with the phenotypey
. The covariance matrix is set byK
. If not provided, or set toNone
, the generalised linear model without random effects is assumed. The covariates can be set via the parameterM
. We recommend to always provide a column of ones when covariates are actually provided.Parameters: - G (array_like) – \(N\) individuals by \(S\) candidate markers.
- y (tuple, array_like) – Either a tuple of two arrays of \(N\) individuals each (Binomial phenotypes) or an array of \(N\) individuals (Normal, Poisson, Bernoulli, or Probit phenotypes).
- lik ("normal", "bernoulli", "probit", binomial", "poisson") – Sample likelihood describing the residual distribution.
- K (array_like, optional) – \(N\)-by-\(N\) covariance matrix (e.g., kinship coefficients).
Set to
None
for a generalised linear model without random effects. Defaults toNone
. - M (array_like, optional) – N individuals by S covariates.
It will create a \(N\)-by-\(1\) matrix
M
of ones representing the offset covariate ifNone
is passed. If an array is passed, it will used as is. Defaults toNone
. - verbose (bool, optional) –
True
to display progress and summary;False
otherwise.
Returns: QTL representation.
Return type: Examples
>>> from numpy import dot, exp, sqrt, ones >>> from numpy.random import RandomState >>> from pandas import DataFrame >>> import pandas as pd >>> from limix.qtl import scan >>> >>> random = RandomState(1) >>> pd.options.display.float_format = "{:9.6f}".format >>> >>> n = 30 >>> p = 3 >>> samples_index = range(n) >>> >>> M = DataFrame(dict(offset=ones(n), age=random.randint(10, 60, n))) >>> M.index = samples_index >>> >>> X = random.randn(n, 100) >>> K = dot(X, X.T) >>> >>> candidates = random.randn(n, p) >>> candidates = DataFrame(candidates, index=samples_index, ... columns=['rs0', 'rs1', 'rs2']) >>> >>> y = random.poisson(exp(random.randn(n))) >>> >>> model = scan(candidates, y, 'poisson', K, M=M, verbose=False) >>> >>> model.variant_pvalues.to_dataframe() pv candidate rs0 0.554444 rs1 0.218996 rs2 0.552200 >>> model.variant_effsizes.to_dataframe() effsizes candidate rs0 -0.130867 rs1 -0.315078 rs2 -0.143869 >>> model.variant_effsizes_se.to_dataframe() effsizes std candidate rs0 0.221390 rs1 0.256327 rs2 0.242013 >>> model Variants -------- effsizes effsizes_se pvalues count 3.000000 3.000000 3.000000 mean -0.196604 0.239910 0.441880 std 0.102807 0.017563 0.193027 min -0.315077 0.221389 0.218996 25% -0.229473 0.231701 0.385598 50% -0.143869 0.242013 0.552200 75% -0.137367 0.249170 0.553322 max -0.130866 0.256326 0.554443 Covariate effect sizes for H0 ----------------------------- age offset -0.005568 0.395287
Notes
It will raise a
ValueError
exception if non-finite values are passed. Please, refer to thelimix.qc.mean_impute()
function for missing value imputation.
-
class
limix.qtl.
QTLModel
(null_lml, alt_lmls, effsizes, null_covariate_effsizes)[source] Result of a QTL analysis.
An instance of this class is returned by
limix.qtl.scan()
.-
alt_lmls
Log of the marginal likelihoods across tested variants.
Returns: Log of marginal likelihoods. Return type: array_like
-
null_covariate_effsizes
Covariate effect-sizes under the null hypothesis.
Returns: Estimated covariant effect sizes under the null hypothesis. Return type: array_like
-
null_lml
Log of the marginal likelihood under the null hypothesis.
Returns: Log of marginal likelihood. Return type: float
-
variant_effsizes
Variant effect-sizes.
Returns: Estimated variant effect sizes. Return type: array_like
-
variant_effsizes_se
Standard errors of the variant effect sizes.
Returns: Estimated standard errors of the variant effect sizes. Return type: array_like
-
variant_pvalues
Variant p-values.
Returns: Association significance between variant and phenotype. Return type: array_like
-
References
[McC89] | McCullagh, Peter, and John A. Nelder. Generalized linear models. Vol. 37. CRC press, 1989. |
[McC11] | McCulloch, Charles E., and Shayle R. Searle. Generalized, linear, and mixed models. John Wiley & Sons, 2004. |
[ExFam] | Wikipedia contributors. (2018, June 29). Exponential family. In Wikipedia, The Free Encyclopedia. Retrieved 13:47, July 26, 2018, from https://en.wikipedia.org/w/index.php?title=Exponential_family&oldid=848114709 |
[LRT] | Wikipedia contributors. (2018, June 6). Likelihood-ratio test. In Wikipedia, The Free Encyclopedia. Retrieved 13:50, July 26, 2018, from https://en.wikipedia.org/w/index.php?title=Likelihood-ratio_test&oldid=844734768 |
Shell utilities¶
Statistics¶
Statistics¶
Principal component analysis¶
-
limix.stats.
pca
(X, ncomp)[source] Principal component analysis.
Parameters: - X (array_like) – Data.
- ncomp (int) – Number of components.
Returns: - components (array_like): first components ordered by explained variance.
- explained_variance (array_like): explained variance.
- explained_variance_ratio (array_like): percentage of variance explained.
Return type: Examples
>>> from numpy import round >>> from numpy.random import RandomState >>> from limix.stats import pca >>> >>> X = RandomState(1).randn(4, 5) >>> r = pca(X, ncomp=2) >>> round(r['components'], 2) array([[-0.75, 0.58, -0.08, 0.2 , -0.23], [ 0.49, 0.72, 0.02, -0.46, -0.16]]) >>> round(r['explained_variance'], 4) array([ 6.4466, 0.5145]) >>> round(r['explained_variance_ratio'], 4) array([ 0.9205, 0.0735])
P-value correction¶
-
limix.stats.
multipletests
(pvals, alpha=0.05, method='hs', is_sorted=False)[source] Test results and p-value correction for multiple tests.
Parameters: - pvals (array_like) – Uncorrected p-values.
- alpha (float) – FWER, family-wise error rate, e.g.
0.1
. - method (string) –
Method used for testing and adjustment of pvalues. Can be either the full name or initial letters. Available methods are
`bonferroni` : one-step correction `sidak` : one-step correction `holm-sidak` : step down method using Sidak adjustments `holm` : step-down method using Bonferroni adjustments `simes-hochberg` : step-up method (independent) `hommel` : closed method based on Simes tests (non-negative) `fdr_bh` : Benjamini/Hochberg (non-negative) `fdr_by` : Benjamini/Yekutieli (negative) `fdr_tsbh` : two stage fdr correction (non-negative) `fdr_tsbky` : two stage fdr correction (non-negative)
- is_sorted (bool) – If
False
(default), the p_values will be sorted, but the corrected pvalues are in the original order. IfTrue
, then it assumed that the pvalues are already sorted in ascending order.
Returns: - reject (array_like, boolean) –
true
for hypothesis that can be rejected for given alpha. - pvals_corrected (array_like) – P-values corrected for multiple tests.
- alphacSidak (float) – corrected alpha for Sidak method.
- alphacBonf (float) – corrected alpha for Bonferroni method.
Notes
This is a wrapper around a function from the statsmodels package.
-
limix.stats.
empirical_pvalues
(xt, x0)[source] Function to compute empirical p-values.
Compute empirical p-values from the test statistics observed on the data and the null test statistics (from permutations, parametric bootstraps, etc).
Parameters: - xt (array_like) – Test statistcs observed on data.
- x0 (array_like) – Null test statistcs. The minimum p-value that can be
estimated is
1./float(len(x0))
.
Returns: Estimated empirical p-values.
Return type: array_like
Examples
>>> from numpy.random import RandomState >>> from limix.stats import empirical_pvalues >>> >>> random = RandomState(1) >>> x0 = random.chisquare(1, 5) >>> x1 = random.chisquare(1, 10000) >>> >>> empirical_pvalues(x0, x1) array([0.563 , 1. , 0.839 , 0.7982, 0.5803])
-
limix.stats.
Chi2Mixture
(scale_min=0.1, scale_max=5.0, dof_min=0.1, dof_max=5.0, n_intervals=100, qmax=0.1, tol=0)[source] A class for continuous random variable following a chi2 mixture.
Class for evaluation of P values for a test statistic that follows a two-component mixture of chi2
\[(1-\pi)\chi^2(0) + \pi a \chi^2(d).\]Here \(\pi\) is the probability being in the first component and \(a\) and \(d\) are the scale parameter and the number of degrees of freedom of the second component.
Parameters: - scale_min (float) – Minimum value used for fitting the scale parameter.
- scale_max (float) – Maximum value used for fitting the scale parameter.
- dofmin (float) – Minimum value used for fitting the dof parameter.
- dofmax (float) – Maximum value used for fitting the dof parameter.
- qmax (float) – Only the top qmax quantile is used for the fit.
- n_interval (int) – Number of intervals when performing gridsearch.
- tol (float) – Tolerance of being zero.
Examples
>>> from numpy.random import RandomState >>> import scipy as sp >>> from limix.stats import Chi2Mixture >>> >>> scale = 0.3 >>> dof = 2 >>> mixture = 0.2 >>> n = 100 >>> >>> random = RandomState(1) >>> x = random.chisquare(dof, n) >>> n0 = int( (1-mixture) * n) >>> idxs = random.choice(n, n0, replace=False) >>> x[idxs] = 0 >>> >>> chi2mix = Chi2Mixture(scale_min=0.1, scale_max=5.0, ... dof_min=0.1, dof_max=5.0, ... qmax=0.1, tol=4e-3) >>> chi2mix.estimate_chi2mixture(x) >>> pv = chi2mix.sf(x) >>> print(pv[:4]) [0.2 0.2 0.2 0.2] >>> >>> print('%.2f' % chi2mix.scale) 1.98 >>> print('%.2f' % chi2mix.dof) 0.89 >>> print('%.2f' % chi2mix.mixture) 0.20
Ground truth evaluation¶
-
limix.stats.
confusion_matrix
(df, wsize=50000)[source] Provide a couple of scores based on the idea of windows around genetic markers.
Parameters: - causals – Indices defining the causal markers.
- pos – Within-chromossome base-pair position of each candidate marker, in crescent order.
Kinship¶
-
limix.stats.
linear_kinship
(G, out=None, verbose=True)[source] Estimate Kinship matrix via linear kernel.
Examples
>>> from numpy.random import RandomState >>> from numpy import array_str >>> from limix.stats import linear_kinship >>> >>> random = RandomState(1) >>> X = random.randn(4, 100) >>> K = linear_kinship(X, verbose=False) >>> print(array_str(K, precision=4)) [[ 0.9131 -0.1928 -0.3413 -0.379 ] [-0.1928 0.8989 -0.2356 -0.4704] [-0.3413 -0.2356 0.9578 -0.3808] [-0.379 -0.4704 -0.3808 1.2302]]
Likelihood-ratio test¶
-
limix.stats.
effsizes_se
(effsizes, pvalues)[source] Standard errors of the effect sizes.
Parameters: - effsizes (array_like) – Effect sizes.
- pvalues (array_like) – Association significance corresponding to those effect sizes.
Returns: Standard errors of the effect sizes.
Return type: array_like
-
limix.stats.
lrt_pvalues
(null_lml, alt_lmls, dof=1)[source] Compute p-values from likelihood ratios.
These are likelihood ratio test p-values.
Parameters: Returns: P-values.
Return type: array_like
-
limix.stats.
pca
(X, ncomp)[source] Principal component analysis.
Parameters: - X (array_like) – Data.
- ncomp (int) – Number of components.
Returns: - components (array_like): first components ordered by explained variance.
- explained_variance (array_like): explained variance.
- explained_variance_ratio (array_like): percentage of variance explained.
Return type: Examples
>>> from numpy import round >>> from numpy.random import RandomState >>> from limix.stats import pca >>> >>> X = RandomState(1).randn(4, 5) >>> r = pca(X, ncomp=2) >>> round(r['components'], 2) array([[-0.75, 0.58, -0.08, 0.2 , -0.23], [ 0.49, 0.72, 0.02, -0.46, -0.16]]) >>> round(r['explained_variance'], 4) array([ 6.4466, 0.5145]) >>> round(r['explained_variance_ratio'], 4) array([ 0.9205, 0.0735])
-
limix.stats.
multipletests
(pvals, alpha=0.05, method='hs', is_sorted=False)[source] Test results and p-value correction for multiple tests.
Parameters: - pvals (array_like) – Uncorrected p-values.
- alpha (float) – FWER, family-wise error rate, e.g.
0.1
. - method (string) –
Method used for testing and adjustment of pvalues. Can be either the full name or initial letters. Available methods are
`bonferroni` : one-step correction `sidak` : one-step correction `holm-sidak` : step down method using Sidak adjustments `holm` : step-down method using Bonferroni adjustments `simes-hochberg` : step-up method (independent) `hommel` : closed method based on Simes tests (non-negative) `fdr_bh` : Benjamini/Hochberg (non-negative) `fdr_by` : Benjamini/Yekutieli (negative) `fdr_tsbh` : two stage fdr correction (non-negative) `fdr_tsbky` : two stage fdr correction (non-negative)
- is_sorted (bool) – If
False
(default), the p_values will be sorted, but the corrected pvalues are in the original order. IfTrue
, then it assumed that the pvalues are already sorted in ascending order.
Returns: - reject (array_like, boolean) –
true
for hypothesis that can be rejected for given alpha. - pvals_corrected (array_like) – P-values corrected for multiple tests.
- alphacSidak (float) – corrected alpha for Sidak method.
- alphacBonf (float) – corrected alpha for Bonferroni method.
Notes
This is a wrapper around a function from the statsmodels package.
-
limix.stats.
empirical_pvalues
(xt, x0)[source] Function to compute empirical p-values.
Compute empirical p-values from the test statistics observed on the data and the null test statistics (from permutations, parametric bootstraps, etc).
Parameters: - xt (array_like) – Test statistcs observed on data.
- x0 (array_like) – Null test statistcs. The minimum p-value that can be
estimated is
1./float(len(x0))
.
Returns: Estimated empirical p-values.
Return type: array_like
Examples
>>> from numpy.random import RandomState >>> from limix.stats import empirical_pvalues >>> >>> random = RandomState(1) >>> x0 = random.chisquare(1, 5) >>> x1 = random.chisquare(1, 10000) >>> >>> empirical_pvalues(x0, x1) array([0.563 , 1. , 0.839 , 0.7982, 0.5803])
-
class
limix.stats.
Chi2Mixture
(scale_min=0.1, scale_max=5.0, dof_min=0.1, dof_max=5.0, n_intervals=100, qmax=0.1, tol=0)[source] A class for continuous random variable following a chi2 mixture.
Class for evaluation of P values for a test statistic that follows a two-component mixture of chi2
\[(1-\pi)\chi^2(0) + \pi a \chi^2(d).\]Here \(\pi\) is the probability being in the first component and \(a\) and \(d\) are the scale parameter and the number of degrees of freedom of the second component.
Parameters: - scale_min (float) – Minimum value used for fitting the scale parameter.
- scale_max (float) – Maximum value used for fitting the scale parameter.
- dofmin (float) – Minimum value used for fitting the dof parameter.
- dofmax (float) – Maximum value used for fitting the dof parameter.
- qmax (float) – Only the top qmax quantile is used for the fit.
- n_interval (int) – Number of intervals when performing gridsearch.
- tol (float) – Tolerance of being zero.
Examples
>>> from numpy.random import RandomState >>> import scipy as sp >>> from limix.stats import Chi2Mixture >>> >>> scale = 0.3 >>> dof = 2 >>> mixture = 0.2 >>> n = 100 >>> >>> random = RandomState(1) >>> x = random.chisquare(dof, n) >>> n0 = int( (1-mixture) * n) >>> idxs = random.choice(n, n0, replace=False) >>> x[idxs] = 0 >>> >>> chi2mix = Chi2Mixture(scale_min=0.1, scale_max=5.0, ... dof_min=0.1, dof_max=5.0, ... qmax=0.1, tol=4e-3) >>> chi2mix.estimate_chi2mixture(x) >>> pv = chi2mix.sf(x) >>> print(pv[:4]) [0.2 0.2 0.2 0.2] >>> >>> print('%.2f' % chi2mix.scale) 1.98 >>> print('%.2f' % chi2mix.dof) 0.89 >>> print('%.2f' % chi2mix.mixture) 0.20
-
estimate_chi2mixture
(lrt)[source] Estimates the parameters of a chi2 mixture.
Estimates the parameters of a chi2 mixture by fitting the empirical distribution of null test statistic.
Parameters: lrt (array_like) – Null test statistcs.
-
sf
(lrt)[source] Computes the p-values from test statistics lrt.
Parameters: lrt (array_like) – Test statistics. Returns: P-values. Return type: array_like
-
limix.stats.
linear_kinship
(G, out=None, verbose=True)[source] Estimate Kinship matrix via linear kernel.
Examples
>>> from numpy.random import RandomState >>> from numpy import array_str >>> from limix.stats import linear_kinship >>> >>> random = RandomState(1) >>> X = random.randn(4, 100) >>> K = linear_kinship(X, verbose=False) >>> print(array_str(K, precision=4)) [[ 0.9131 -0.1928 -0.3413 -0.379 ] [-0.1928 0.8989 -0.2356 -0.4704] [-0.3413 -0.2356 0.9578 -0.3808] [-0.379 -0.4704 -0.3808 1.2302]]
-
limix.stats.
lrt_pvalues
(null_lml, alt_lmls, dof=1)[source] Compute p-values from likelihood ratios.
These are likelihood ratio test p-values.
Parameters: Returns: P-values.
Return type: array_like
-
limix.stats.
effsizes_se
(effsizes, pvalues)[source] Standard errors of the effect sizes.
Parameters: - effsizes (array_like) – Effect sizes.
- pvalues (array_like) – Association significance corresponding to those effect sizes.
Returns: Standard errors of the effect sizes.
Return type: array_like
-
limix.stats.
confusion_matrix
(df, wsize=50000)[source] Provide a couple of scores based on the idea of windows around genetic markers.
Parameters: - causals – Indices defining the causal markers.
- pos – Within-chromossome base-pair position of each candidate marker, in crescent order.
-
limix.stats.
allele_frequency
(expec)[source] Compute allele frequency from its expectation.
Parameters: expec (array_like) – Allele expectations encoded as a variants-by-samples-by-alleles matrix. Returns: Allele frequencies encoded as a variants-by-alleles matrix. Return type: numpy.ndarray
-
limix.stats.
compute_dosage
(expec, alt=None)[source] Compute dosage from allele expectation.
Parameters: - expec (array_like) – Allele expectations encoded as a variants-by-samples-by-alleles matrix.
- ref (array_like) – Allele reference of each locus. The allele having the minor allele frequency
for the provided
expec
is used as the reference if None. Defaults to None.
Returns: Dosage encoded as a variants-by-samples matrix.
Return type: Examples
>>> from bgen_reader import read_bgen, allele_expectation, example_files >>> from bgen_reader import compute_dosage >>> >>> with example_files("example.32bits.bgen") as filepath: ... bgen = read_bgen(filepath, verbose=False) ... e = allele_expectation(bgen["genotype"], nalleles=2, ploidy=2) ... dosage = compute_dosage(e).compute() ... print(dosage.shape) ... print(dosage) (199, 500) [[ nan 1.93575854 1.91558579 ... 1.94351192 0.10894776 1.01101689] [1.98779296 1.97802735 0.02111815 ... 1.95492412 1.00897216 1.02255316] [0.01550294 0.99383543 1.97933958 ... 1.98681641 1.99041748 1.99603272] ... [1.99319479 1.980896 1.98767124 ... 1.9943846 1.99716186 1.98712159] [0.01263467 0.09661863 0.00869752 ... 0.00643921 0.00494384 0.01504517] [0.99185182 1.94860838 0.99734497 ... 0.02914425 1.97827146 0.9515991 ]]
-
limix.stats.
allele_expectation
(p, nalleles, ploidy)[source] Allele expectation.
Compute the expectation of each allele from the given probabilities. It accepts three shapes of matrices: - unidimensional array of probabilities; - bidimensional samples-by-alleles probabilities array; - and three dimensional variants-by-samples-by-alleles array.
Parameters: Returns: Last dimension will contain the expectation of each allele.
Return type: Examples
>>> from texttable import Texttable >>> from bgen_reader import read_bgen, allele_expectation, example_files >>> >>> sampleid = "sample_005" >>> rsid = "RSID_6" >>> >>> with example_files("example.32bits.bgen") as filepath: ... bgen = read_bgen(filepath, verbose=False) ... ... locus = bgen["variants"].query("rsid == '{}'".format(rsid)).index ... sample = bgen["samples"].query("id == '{}'".format(sampleid)).index ... ... nalleles = bgen["variants"].loc[locus, "nalleles"].item() ... ploidy = 2 ... ... p = bgen["genotype"][locus[0], sample[0]].compute() ... # For unphased genotypes only. ... e = allele_expectation(bgen["genotype"][locus[0], sample[0]], nalleles, ploidy) ... ... alleles = bgen["variants"].loc[locus, "allele_ids"].item().split(",") ... ... tab = Texttable() ... ... tab.add_rows( ... [ ... ["", "AA", "AG", "GG", "E[.]"], ... ["p"] + list(p) + [1.0], ... ["#" + alleles[0], 2, 1, 0, e[0]], ... ["#" + alleles[1], 0, 1, 2, e[1]], ... ] ... ) >>> print(tab.draw()) +----+-------+-------+-------+-------+ | | AA | AG | GG | E[.] | +====+=======+=======+=======+=======+ | p | 0.012 | 0.987 | 0.001 | 1 | +----+-------+-------+-------+-------+ | #A | 2 | 1 | 0 | 1.011 | +----+-------+-------+-------+-------+ | #G | 0 | 1 | 2 | 0.989 | +----+-------+-------+-------+-------+ >>> print("variant: {}".format(rsid)) variant: RSID_6 >>> print("sample : {}".format(sampleid)) sample : sample_005
Note
This function supports unphased genotypes only.
Tutorials¶
eQTL¶
This tutorial illustrates the use of limix to analyse expression datasets. We consider gene expression levels from a yeast genetics study with freely available data. This data set span 109 individuals with 2,956 marker SNPs and expression levels for 5,493 in glucose and ethanol growth media respectively. It is based on the eQTL basics tutorial of limix 1.0, which is now deprecated.
Importing limix¶
Downloading data¶
We are going to use a HDF5 file containg phenotype and genotyped data from a remote repository. Limix provides some handy utilities to perform common command line tasks, like as downloading and extracting files. However, feel free to use whatever method you prefer.
Selecting gene YBR115C under the glucose condition¶
Query for a specific phenotype, select the phenotype itself, and plot it.
The glucose condition is given by the environment 0
.
header = data[‘phenotype’][‘col_header’] query = “gene_ID==’YBR115C’ and environment==0” idx = header.query(query).i.values y = data[‘phenotype’][‘matrix’][:, idx].ravel()

Univariate association test with linear mixed model¶
You have the option to either pass a raw array of samples-by-candidates for the association scan or pass a tabular structure naming those candidates. We recommend the second option as it will help maintain the association between the results and the corresponding candidates.
The naming of those candidates is defined here by concatenating the chromossome name and base-pair position. However, it is often the case that SNP IDs are provided along with the data, which can naturally be used for naming those candidates.

As you can see, we now have a pandas data frame G
that keeps the candidate
identifications together with the actual allele read.
This data frame can be readily used to perform association scan.

Inspecting the p-values and effect-sizes are now easier because candidate names are kept together with their corresponding statistics.

A Manhattan plot can help understand the result.

We then remove the temporary files.

Variance decomposition¶
Limix enables flexible fitting of variance component models. Here, we illustrate the usage of variance component models fit to single genes. It is based on the eQTL basics tutorial of limix 1.0, which is now deprecated.
The Genetic Model¶
The model considered by the LIMIX variance decomposition module is an extension of the genetic model employed in standard GWAS, however considers multiple random effects terms:
where
If \(\mathbf{R}\) is a genetic contribution from set of SNPs \(\mathcal{S}\), with a bit of abuse of notation we can write \(\mathbf{R}= \frac{1}{C}\mathbf{X}_{:,\,\mathcal{S}}{\mathbf{X}_{:,\,\mathcal{S}}}^T\) where \(C=\frac{1}{N}\text{trace}\left(\mathbf{X}_{:,\,\mathcal{S}_i}{\mathbf{X}_{:,\,\mathcal{S}_i}}^T\right)\).
Limix supports an arbitrary number of fixed or random effects to be included in the model.
Example: cis/trans Variance Decomposition in the glucose condition¶
Here we use the LIMIX variance decomposition module to quantify the variability in gene expression explained by proximal (cis) and distal (trans) genetic variation. To do so, we build a linear mixed model with a fixed effect intercept, two random effects for cis and trans genetic effects and a noise random effect:
where \(\mathbf{R}^\text{(cis)}\) and \(\mathbf{R}^\text{(trans)}\) are the local and distal relatedeness matrices, built considering all SNPs in cis and trans (i.e., not in cis) respectively. As cis region is defined by the 50kb region around each gene.
The gene-model is fitted to gene expression in environment 0 for all genes in the Lysine Biosynthesis pathway and variance components are averaged thereafter to obtain pathway based variance components.

We then remove the temporary files.

API reference¶
I/O module¶
limix.io.bgen.read (filepath[, size, …]) |
Read a given BGEN file. |
limix.io.bimbam.read_phenotype (filepath) |
Read a BIMBAM phenotype file. |
limix.io.bimbam.see_phenotype (filepath) |
Shows a summary of a BIMBAM phenotype file. |
limix.io.csv.read (filename[, sep, header]) |
Read a CSV file. |
limix.io.csv.see (filepath, header[, verbose]) |
Shows a human-friendly representation of a CSV file. |
limix.io.gen.read (prefix) |
|
limix.io.hdf5.fetch (fp, path) |
Fetches an array from hdf5 file. |
limix.io.hdf5.fetcher (filename) |
Fetch datasets from HDF5 files. |
limix.io.hdf5.read_limix (filepath) |
Read the HDF5 limix file format. |
limix.io.hdf5.see (f_or_filepath[, …]) |
Shows a human-friendly tree representation of the contents of a hdf5 file. |
limix.io.npy.read (filepath[, verbose]) |
|
limix.io.npy.save (filepath, X[, verbose]) |
|
limix.io.npy.see (filepath[, verbose]) |
|
limix.io.plink.read (prefix[, verbose]) |
Read PLINK files into Pandas data frames. |
limix.io.plink.see_bed (filepath, verbose) |
|
limix.io.plink.see_kinship (filepath, verbose) |
|
limix.io.plink.fetch_dosage (prefix, verbose) |
Quality control¶
limix.qc.boxcox (x) |
Box Cox transformation for normality conformance. |
limix.qc.compute_maf (X) |
Compute minor allele frequencies. |
limix.qc.count_missingness (X) |
Count the number of missing values per column. |
limix.qc.indep_pairwise (X, window_size, …) |
Determine pair-wise independent variants. |
limix.qc.mean_impute (X) |
Column-wise impute NaN values by column mean. |
limix.qc.mean_standardize (X[, axis, out]) |
Zero-mean and one-deviation normalisation. |
limix.qc.normalise_covariance (K[, out]) |
Variance rescaling of covariance matrix K . |
limix.qc.quantile_gaussianize (x) |
Normalize a sequence of values via rank and Normal c.d.f. |
limix.qc.regress_out (Y, X[, return_b]) |
Regresses out X from Y |
limix.qc.remove_dependent_cols (X[, tol, verbose]) |
Remove dependent columns. |
limix.qc.unique_variants (X) |
Filters out variants with the same genetic profile. |
Statistics¶
limix.stats.allele_expectation (p, nalleles, …) |
Allele expectation. |
limix.stats.allele_frequency (expec) |
Compute allele frequency from its expectation. |
limix.stats.Chi2Mixture ([scale_min, …]) |
A class for continuous random variable following a chi2 mixture. |
limix.stats.compute_dosage (expec[, alt]) |
Compute dosage from allele expectation. |
limix.stats.confusion_matrix (df[, wsize]) |
Provide a couple of scores based on the idea of windows around genetic markers. |
limix.stats.effsizes_se (effsizes, pvalues) |
Standard errors of the effect sizes. |
limix.stats.empirical_pvalues (xt, x0) |
Function to compute empirical p-values. |
limix.stats.linear_kinship (G[, out, verbose]) |
Estimate Kinship matrix via linear kernel. |
limix.stats.lrt_pvalues (null_lml, alt_lmls) |
Compute p-values from likelihood ratios. |
limix.stats.multipletests (pvals[, alpha, …]) |
Test results and p-value correction for multiple tests. |
limix.stats.pca (X, ncomp) |
Principal component analysis. |
Heritability estimation¶
limix.her.estimate (y, lik, K[, M, verbose]) |
Estimate the so-called narrow-sense heritability. |
Quantitative trait loci¶
limix.qtl.scan (G, y, lik[, K, M, verbose]) |
Single-variant association testing via generalised linear mixed models. |
limix.qtl.QTLModel (null_lml, alt_lmls, …) |
Result of a QTL analysis. |
Plotting & Graphics¶
limix.plot.box_aspect ([ax]) |
Change to box aspect considering the plotted points. |
limix.plot.ConsensusCurve () |
Consolidate multiple curves in a single one. |
limix.plot.image (file[, ax]) |
Show an image. |
limix.plot.kinship (K[, nclusters, img_kws, ax]) |
Plot heatmap of a kinship matrix. |
limix.plot.load_dataset (name) |
Example datasets. |
limix.plot.manhattan (data[, colora, colorb, …]) |
Produce a manhattan plot. |
limix.plot.normal (x[, bins, nstd, ax]) |
Plot a fit of a normal distribution to the data in x. |
limix.plot.pca (X[, pts_kws, ax]) |
Plot the first two principal components of a design matrix. |
limix.plot.power (pv[, label, alphas, …]) |
Plot number of hits across significance levels. |
limix.plot.qqplot (a[, label, alpha, cutoff, …]) |
Quantile-Quantile plot of observed p-values versus theoretical ones. |
limix.plot.image (file[, ax]) |
Show an image. |
limix.plot.get_pyplot () |
|
limix.plot.show () |
Generalised Linear Mixed Models¶
limix.glmm.GLMMComposer.covariance_matrices |
Get the covariance matrices. |
limix.glmm.GLMMComposer.decomp () |
Get the fixed and random effects. |
limix.glmm.GLMMComposer.fit ([verbose]) |
Fit the model. |
limix.glmm.GLMMComposer.fixed_effects |
Get the fixed effects. |
limix.glmm.GLMMComposer.likname |
Get likelihood name. |
limix.glmm.GLMMComposer.lml () |
Get the log of the marginal likelihood. |
limix.glmm.GLMMComposer.y |
Get the outcome array. |
Shell utilities¶
limix.sh.filehash (filepath) |
Compute sha256 from a given file. |
limix.sh.download (url[, dest, verbose, force]) |
|
limix.sh.extract (filepath[, verbose]) |
|
limix.sh.remove (filepath) |
Developers¶
Data type¶
-
limix._dataset.
_normalise_dataset
(y, M=None, G=None, K=None)[source]¶ Convert data types to DataArray.
This is a fundamental function for
limix
as it standardise outcome, covariates, candidates, and kinship arrays intoxarray.DataArray
data type. Data arrays arenumpy
/dask
arrays with indexed coordinates, therefore generalising data frames frompandas
. It allows for lazy loading of data via dask arrays. It also supports arrays with different dimensionality and types, mixture of indexed and non-indexed arrays, and repeated sample labels.Examples
>>> from __future__ import unicode_literals >>> import pytest >>> from numpy.random import RandomState >>> from pandas import DataFrame >>> from xarray import DataArray >>> from limix._dataset import _normalise_dataset >>> >>> random = RandomState(0) >>> >>> y = random.randn(4) >>> y = DataFrame(y, index=["sample0", "sample0", "sample1", "sample2"]) >>> >>> G = random.randn(5, 6) >>> >>> data = _normalise_dataset(y, G=G) >>> print(data["y"]) <xarray.DataArray 'outcome' (sample: 4, trait: 1)> array([[1.764052], [0.400157], [0.978738], [2.240893]]) Coordinates: * sample (sample) object 'sample0' 'sample0' 'sample1' 'sample2' * trait (trait) int64 0 >>> print(data["G"]) <xarray.DataArray 'candidates' (sample: 4, candidate: 6)> array([[ 1.867558, -0.977278, 0.950088, -0.151357, -0.103219, 0.410599], [ 0.144044, 1.454274, 0.761038, 0.121675, 0.443863, 0.333674], [ 1.494079, -0.205158, 0.313068, -0.854096, -2.55299 , 0.653619], [ 0.864436, -0.742165, 2.269755, -1.454366, 0.045759, -0.187184]]) Coordinates: * sample (sample) object 'sample0' 'sample0' 'sample1' 'sample2' Dimensions without coordinates: candidate >>> K = random.randn(3, 3) >>> K = K.dot(K.T) >>> K = DataArray(K) >>> K.coords["dim_0"] = ["sample0", "sample1", "sample2"] >>> K.coords["dim_1"] = ["sample0", "sample1", "sample2"] >>> >>> data = _normalise_dataset(y, K=K) >>> print(data["y"]) <xarray.DataArray 'outcome' (sample: 4, trait: 1)> array([[1.764052], [0.400157], [0.978738], [2.240893]]) Coordinates: * sample (sample) object 'sample0' 'sample0' 'sample1' 'sample2' * trait (trait) int64 0 >>> print(data["K"]) <xarray.DataArray 'variance-covariance' (sample_0: 4, sample_1: 4)> array([[ 1.659103, 1.659103, -0.850801, -1.956422], [ 1.659103, 1.659103, -0.850801, -1.956422], [-0.850801, -0.850801, 1.687126, -0.194938], [-1.956422, -1.956422, -0.194938, 6.027272]]) Coordinates: * sample_0 (sample_0) <U7 'sample0' 'sample0' 'sample1' 'sample2' * sample_1 (sample_1) <U7 'sample0' 'sample0' 'sample1' 'sample2' >>> with pytest.raises(ValueError): ... _normalise_dataset(y, G=G, K=K)
Comments and bugs¶
You can get the source and open issues on Github.