Docs PyPI

EpiScanpy – Epigenomics single cell analysis in python

EpiScanpy is a toolkit to analyse single-cell open chromatin (scATAC-seq) and single-cell DNA methylation (for example scBS-seq) data. EpiScanpy is the epigenomic extension of the very popular scRNA-seq analysis tool Scanpy (Genome Biology, 2018) [Wolf18]. For more information, read scanpy documentation.

The documentation for epiScanpy is available here. If epiScanpy is useful to your research, consider citing epiScanpy.

Report issues and access the code on GitHub.

Note

Also see the release notes of scanpy.

Also see the release notes of anndata.

Version 0.1.0 May 10, 2019

Initial release.

Tutorials


Single cell ATAC-seq

To get started, we recommend epiScanpy’s analysis pipeline for scATAC-seq data for 10k PBMCs from 10x Genomics. This tutorial focuses on preprocessing, clustering, identification of cell types via known marker genes and trajectory inference. The tutorial can be found here.

NB: The current tutorial is a beta version that do not include either optimal low embedding & clustering settings or proper cell type identifications. To check out performance see other tutorials below and an updated version will be available very soon.

_images/pbmc_PCA_coverage.png _images/pbmc_umap.png _images/pbmc_diffmap.png

If you want to see how to build count matrices from ATAC-seq bam files for different set of annotations (like enhancers). The tutorial can be found here.

Soon available, there will be a tutorial on preprocessing, clustering and identification of cell types for the Cusanovich mouse scATAC-seq atlas [Cusanovich18] (prefrontal cortex data). In this tutorial we focus on the use of different feature space count matrices (peak and enhancer based count matrices).

_images/umap.png _images/umap_ATACseq_Astrocyte_marker.png _images/heatmap.png

An additional tutorial on processing and clustering count matrices from the Cusanovich atlas. Here.


Single cell DNA methylation

Here you can find a tutorial for the preprocessing, clustering and identification of cell types for single-cell DNA methylation data using the publicly available data from Luo et al. [Luo17].

The first tutorial shows how to build the count matrices for the different feature spaces (windows, promoters) in different cytosine contexts. Here is the tutorial.

Then, there is a second tutorial on how to use them and compare the results. The data used comes from mouse brain (frontal cortex). It will be available very soon.

_images/umap_markers_hodology_ecker.png _images/umapexcitatory_neurons_promoters.png _images/umapSatb2_CLUSTER_NORM.png _images/umapenhancer_CG_Luoetal.png

Usage Principles

Import the epiScanpy API as:

import episcanpy.api as epi
import anndata as ad

Workflow

The first step is to build the count matrix. Because single-cell epigenomic data types have different characteristics (count data in ATAC-seq versus methylation level in DNA methylation, for example), epiScanpy implements -omic specific approaches to build the count matrix. All the functions to build the count matrices (for ATAC, methylation or other) will use epi.ct (ct = count).

The first step is to load an annotation and then build the count matrix that will be either methylation or ATAC-seq specific. For example using epi.ct, e.g.:

epi.ct.load_features(file_features, **tool_params)  # to load annotation files
epi.ct.build_count_mtx(cell_file_names, omic="ATAC") # to build the ATAC-seq count matrix

If you have an already build matrix, you can load it with any additional metadata (such as cell annotations or batches).

The count matrix, either the one that has been constructed or uploaded, with any additional informations (such as cell annotations or batches) are stored as an AnnData object. All functions for quality control and preprocessing are called using epi.pp (pp = preprocessing).

To visualise how common features are and what is the coverage distribution of the count matrix features, use:

epi.pp.commoness_features(adata, **processing_params)
epi.pp.coverage_cells(adata, **processing_params)

The next step, is the calculation of tSNE, UMAP, PCA etc. For that, we take advantage of the embedding into Scanpy and we use mostly Scanpy functions, which are called using sc.tl (tl = tool) [Wolf18]. For that, see Scanpy usage principles: <https://scanpy.readthedocs.io/en/latest/basic_usage.html>`__. For example, to obtain cell-cell distance calculations or low dimensional representation we make use of the adata object, and store n_obs observations (cells) of n_vars variables (expression, methylation, chromatin features). For each tool, there typically is an associated plotting function in sc.tl and sc.pl (pl = plot)

sc.tl.tsne(adata, **tool_params)
sc.pl.tsne(adata, **plotting_params)

There are also epiScanpy specific tools and plotting functions that can be accessed using epi.tl and epi.pl

epi.tl.silhouette(adata, **tool_params)
epi.pl.silhouette(adata, **plotting_params)
epi.pl.prct_overlap(adata, **plotting_params)

Data structure

Similarly to Scanpy, the methylation and ATAC-seq matrices are stored as anndata object. For more information on the datastructure see here`here <https://anndata.readthedocs.io/en/latest/>`__

System Requirements

Hardware requirements

epiScanpy package requires only a standard computer with enough RAM to support the in-memory operations.

Software requirements

### OS Requirements This package is supported for macOS and Linux. The package has been tested on the following systems: + macOS: Mojave (10.14.1)

Python Dependencies

EpiScanpy require a working version of Python (>= 3.5)

Additionally, this package epiScanpy depends on other Python dependencies and packages.:

anndata
matplotlib
numpy
pandas
pyliftOver
pysam
scanpy
scipy
scikit-learn
seaborn

Installation

Anaconda

If you do not have a working Python 3.5 or 3.6 installation, consider installing Miniconda (see Installing Miniconda). Then run:

conda install seaborn scikit-learn statsmodels numba
conda install -c conda-forge python-igraph louvain
conda create -n scanpy python=3.6 scanpy

Finally, run:

conda install -c annadanese episcanpy

Pull epiScanpy from PyPI (consider using pip3 to access Python 3):

pip install episcanpy

Github

you can also install epiScanpy directly from Github:

pip install git+https://github.com/colomemaria/epiScanpy

API

Import epiScanpy’s high-level API as:

import episcanpy.api as epi

Count Matrices: CT

Loading data, loading annotations, building count matrices, filtering of lowly covered methylation variables. Filtering of lowly covered cells.

Load features

In order to build a count matrix for either methylation or open chromatin data, loading the segmentation of the genome of interest or the set of features of interest is a prerequirement.

ct.load_features(file_features[, …])

The function load features is here to transform a bed file into a usable set of units to measure methylation levels.

ct.make_windows(size[, chromosomes, max_length])

Generate windows/bins of the given size for the appropriate genome (default choice is human).

ct.size_feature_norm(loaded_feature, size)

If the features loaded are too smalls or of different sizes, it is possible to normalise them to a unique given size by extending the feature coordinate in both directions.

ct.plot_size_features(loaded_feature[, …])

Plot the different feature sizes in an histogram.

ct.name_features(loaded_features)

Extract the names of the loaded features, specifying the chromosome they originated from.

Reading methylation file

Functions to read methylation files, extract methylation and buildthe count matrices:

ct.build_count_mtx(cells, annotation[, …])

Build methylation count matrix for a given annotation.

ct.read_cyt_summary(sample_name, meth_type, …)

Read file from which you want to extract the methylation level and (assuming it is like the Ecker/Methylpy format) extract the number of methylated read and the total number of read for the cytosines covered and in the right genomic context (CG or CH) :param sample_name: name of the file to read to extract key information.

ct.load_met_noimput(matrix_file[, path, save])

read the raw count matrix and convert it into an AnnData object.

Reading open chromatin(ATAC) file

ATAC-seq specific functions to build count matrices and load data:

ct.bld_atac_mtx(list_bam_files, loaded_feat)

Build a count matrix one set of features at a time.

ct.save_sparse_mtx(initial_matrix[, …])

Convert regular atac matrix into a sparse Anndata:

General functions

Functions non -omic specific:

ct.save_sparse_mtx(initial_matrix[, …])

Convert regular atac matrix into a sparse Anndata:

Preprocessing: PP

Imputing missing data (methylation), filtering lowly covered cells or variables, correction for batch effect.

pp.coverage_cells(adata[, bins, key_added, …])

Histogram of the number of open features (in the case of ATAC-seq data) per cell.

pp.commoness_features

pp.select_var_feature(adata[, max_score, …])

This function computes a variability score to rank the most variable features across all cells.

pp.binarize(adata[, copy])

convert the count matrix into a binary matrix.

pp.lazy(adata[, pp_pca, nb_pcs, …])

Automatically computes PCA coordinates, loadings and variance decomposition, a neighborhood graph of observations, t-distributed stochastic neighborhood embedding (tSNE) Uniform Manifold Approximation and Projection (UMAP)

pp.load_metadata(adata, metadata_file[, …])

Load observational metadata in adata.obs.

pp.read_ATAC_10x(matrix[, cell_names, …])

Load sparse matrix (including matrices corresponding to 10x data) as AnnData objects.

pp.filter_cells(adata[, min_counts, …])

Filter cell outliers based on counts and numbers of genes expressed.

pp.filter_features(data[, min_counts, …])

Filter features based on number of cells or counts.

pp.normalize_total(adata[, target_sum, …])

Normalize counts per cell.

pp.pca(adata[, n_comps, zero_center, …])

Principal component analysis [Pedregosa11].

pp.normalize_per_cell(adata[, …])

Normalize total counts per cell.

pp.regress_out(adata, keys[, n_jobs, copy])

Regress out unwanted sources of variation.

pp.subsample(data[, fraction, n_obs, …])

Subsample to a fraction of the number of observations.

pp.downsample_counts(adata[, …])

Downsample counts from count matrix.

pp.neighbors(adata[, n_neighbors, n_pcs, …])

Compute a neighborhood graph of observations [McInnes18].

pp.sparse(adata[, copy])

Transform adata.X from a matrix or array to a csc sparse matrix.

Methylation matrices

Methylation specific count matrices.

pp.imputation_met(adata[, …])

Impute missing values in methyaltion level matrices.

pp.load_met_noimput(matrix_file[, path, save])

read the raw count matrix and convert it into an AnnData object.

pp.readandimputematrix(file_name[, min_coverage])

Temporary function to load and impute methyaltion count matrix into an AnnData object

Tools: TL

tl.rank_features(adata, groupby[, omic, …])

It is a wrap-up function of scanpy sc.tl.rank_genes_groups function.

tl.silhouette(adata_name, cluster_annot[, …])

Compute silhouette scores.

tl.lazy(adata[, pp_pca, copy])

Automatically computes PCA coordinates, loadings and variance decomposition, a neighborhood graph of observations, t-distributed stochastic neighborhood embedding (tSNE) Uniform Manifold Approximation and Projection (UMAP)

tl.load_markers(path, marker_list_file)

Convert list of known cell type markers from literature to a dictionary Input list of known marker genes First row is considered the header

tl.identify_cluster(adata, cell_type, …[, …])

Use markers of a given cell type to plot peak openness for peaks in promoters of the given markers Input cell type, cell type markers, peak promoter intersections

tl.top_feature_genes(adata, gtf_file[, …])

tl.find_genes

tl.diffmap(adata[, n_comps, copy])

Diffusion Maps [Coifman05] [Haghverdi15] [Wolf18].

tl.draw_graph(adata[, layout, init_pos, …])

Force-directed graph drawing [Islam11] [Jacomy14] [Chippada18].

tl.tsne(adata[, n_pcs, use_rep, perplexity, …])

t-SNE [Maaten08] [Amir13] [Pedregosa11].

tl.umap(adata[, min_dist, spread, …])

Embed the neighborhood graph using UMAP [McInnes18].

tl.louvain(adata[, resolution, …])

Cluster cells into subgroups [Blondel08] [Levine15] [Traag17].

tl.leiden(adata[, resolution, restrict_to, …])

Cluster cells into subgroups [Traag18].

Plotting: PL

The plotting module episcanpy.plotting largely parallels the tl.* and a few of the pp.* functions. For most tools and for some preprocessing functions, you’ll find a plotting function with the same name.

pl.pca(adata[, color, feature_symbols, …])

Scatter plot in PCA coordinates.

pl.pca_floadings

pl.pca_overview(adata)

Plot PCA results.

pl.pca_variance_ratio(adata[, n_pcs, log, …])

Plot the variance ratio.

pl.tsne(adata[, color, feature_symbols, …])

Scatter plot in tSNE basis.

pl.umap(adata[, color, feature_symbols, …])

Scatter plot in UMAP basis.

pl.diffmappl.draw_graph

pl.rank_feat_groups(adata[, groups, …])

Plot ranking of features.

pl.rank_feat_groups_violin(adata[, groups, …])

Plot ranking of features for all tested comparisons.

pl.rank_feat_groups_dotplot(adata[, groups, …])

Plot ranking of features using dotplot plot (see dotplot())

pl.rank_feat_groups_stacked_violin(adata[, …])

Plot ranking of features using stacked_violin plot (see stacked_violin())

pl.rank_feat_groups_matrixplot(adata[, …])

Plot ranking of features using matrixplot plot (see matrixplot())

pl.rank_feat_groups_heatmap(adata[, groups, …])

Plot ranking of features using heatmap plot (see heatmap())

pl.rank_feat_groups_tracksplot(adata[, …])

Plot ranking of features using heatmap plot (see heatmap())

pl.cal_var(adata[, show])

pl.violin(adata, keys[, groupby, log, …])

Violin plot.

pl.scatter(adata[, x, y, color, use_raw, …])

Scatter plot along observations or variables axes.

pl.ranking(adata, attr, keys[, dictionary, …])

Plot rankings.

pl.clustermap(adata[, obs_keys, use_raw, …])

Hierarchically-clustered heatmap.

pl.stacked_violin(adata, var_names[, …])

Stacked violin plots.

pl.heatmap(adata, var_names[, groupby, …])

Heatmap of the expression values of genes.

pl.dotplot(adata, var_names[, groupby, …])

Makes a dot plot of the expression values of var_names.

pl.matrixplot(adata, var_names[, groupby, …])

Creates a heatmap of the mean expression values per cluster of each var_names If groupby is not given, the matrixplot assumes that all data belongs to a single category.

pl.tracksplot(adata, var_names, groupby[, …])

In this type of plot each var_name is plotted as a filled line plot where the y values correspond to the var_name values and x is each of the cells.

pl.dendrogram(adata, groupby[, …])

Plots a dendrogram of the categories defined in groupby.

pl.correlation_matrix(adata, groupby[, …])

Plots the correlation matrix computed as part of sc.tl.dendrogram.

pl.prct_overlap(adata, key_1, key_2[, norm, …])

% or cell count corresponding to the overlap of different cell types between 2 set of annotations/clusters.

pl.overlap_heatmap(adata, key_1, key_2[, …])

Heatmap of the cluster correspondance between 2 set of annaotations.

pl.cluster_composition(adata, cluster, condition)

pl.silhouette(adata_name, cluster_annot[, …])

Plot the product of tl.silhouette as a silhouette plot

pl.silhouette_tot(adata_name, cluster_annot)

Both compute silhouette scores and plot it.

References

Angerer16

Angerer et al. (2016), destiny – diffusion maps for large-scale single-cell data in R, Bioinformatics.

Cusanovich18

Cusanovich, D. A. et al. A Single-Cell Atlas of In Vivo Mammalian Chromatin Accessibility. Cell 174, 1309–1324.e18 (2018).

Luo17

Luo, C. et al. Single-cell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex. Science 357, 600–604 (2017).

Wolf18

Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).