PicturedRocks–Single Cell RNA-seq Analysis Tool

PicturedRocks is a tool for the analysis of single cell RNA-seq data. Currently, we implement two marker selection approaches:

  • a 1-bit compressed sensing based sparse SVM algorithm, and
  • a mutual information-based greedy feature selection algorithm.

Installing

Please ensure you have Python 3.6 or newer and have numba and scikit-learn installed. The best way to get Python and various dependencies is with Anaconda or Miniconda. Once you have a conda environment, run conda install numba scikit-learn. Then use pip to install PicturedRocks and all additional dependencies:

pip install picturedrocks

To install the latest code from github, clone our github repository. Once inside the project directory, instal by running pip install -e .. The -e option will point a symbolic link to the current directory instead of installing a copy on your system.

Reading data

In addition to various functions for reading input data in scanpy, various methods in picturedrocks need cluster labels.

picturedrocks.read.process_clusts(adata, name='clust', copy=False)

Process cluster labels from an obs column

This copies adata.obs[name] into adata.obs[“clust”] and precomputes cluster indices, number of clusters, etc for use by various functions in PicturedRocks.

Parameters:
  • adata (anndata.AnnData) –
  • copy (bool) – determines whether a copy of AnnData object is returned
Returns:

object with annotation

Return type:

anndata.AnnData

Notes

The information computed here is lost when saving as a .loom file. If a .loom file has cluster information, you should run this function immediately after sc.read_loom.

picturedrocks.read.read_clusts(adata, filename, sep=', ', name='clust', header=True, copy=False)

Read cluster labels from a csv into an obs column

Parameters:
  • adata (anndata.AnnData) – the AnnData object to read labels into
  • filename (str) – filename of the csv file with labels
  • sep (str, optional) – csv delimiter
  • name (str, optional) – destination for label is adata.obs[name]
  • header (bool) – deterimes whether csv has a header line. If false, it is assumed that data begins at the first line of csv
  • copy (bool) – determines whether a copy of AnnData object is returned
Returns:

object with cluster labels

Return type:

anndata.AnnData

Notes

  • Cluster ids will automatically be changed so they are 0-indexed
  • csv can either be two columns (in which case the first column is treated as observation label and merging handled by pandas) or one column (only cluster labels, ordered as in adata)

Preprocessing

The preprocessing module provides basic preprocessing tools. To avoid reinventing the wheel, we won’t repeat methods already in scanpy unless we need functionality not available there.

picturedrocks.preprocessing.pca(data, dim=3, center=True, copy=False)

Runs PCA

Parameters:
  • data (anndata.AnnData) – input data
  • dim (int, optional) – number of PCs to compute
  • center (bool, optional) – determines whether to center data before running PCA
  • copy – determines whether a copy of AnnData object is returned
Returns:

object with obsm["X_pca"], varm["PCs"], and uns["num_pcs"] set

Return type:

anndata.AnnData

Plotting

picturedrocks.plot.genericplot(celldata, coords, **scatterkwargs)

Generate a figure for some embedding of data

This function supports both 2D and 3D plots. This may be used to plot data for any embedding (e.g., PCA or t-SNE). For example usage, see code for pcafigure.

Parameters:
  • celldata (anndata.AnnData) – data to plot
  • coords (numpy.ndarray) – (N, 2) or (N, 3) shaped coordinates of the embedded data
  • **scatterkwargs – keyword arguments to pass to Scatter or Scatter3D in plotly (dictionaries are merged recursively)
picturedrocks.plot.genericwrongplot(celldata, coords, yhat, labels=None, **scatterkwargs)

Plot figure with incorrectly classified points highlighted

This can be used with any 2D or 3D embedding (e.g., PCA or t-SNE). For example code, see pcawrongplot.

Parameters:
  • celldata (anndata.AnnData) – data to plot
  • coords (numpy.ndarray) – (N, 2) or (N, 3) shaped array with coordinates to plot
  • yhat (numpy.ndarray) – (N, 1) shaped array of predicted y values
  • labels (list, optional) – list of axis titles
  • **scatterkwargs – keyword arguments to pass to Scatter or Scatter3D in plotly (dictionaries are merged recursively)
picturedrocks.plot.pcafigure(celldata, **scatterkwargs)

Make a 3D PCA figure for an AnnData object

Parameters:
  • celldata (anndata.AnnData) – data to plot
  • **scatterkwargs – keyword arguments to pass to Scatter or Scatter3D in plotly (dictionaries are merged recursively)
picturedrocks.plot.pcawrongplot(celldata, yhat, **scatterkwargs)

Generate a 3D PCA figure with incorrectly classified points highlighted

Parameters:
  • celldata (anndata.AnnData) – data to plot
  • yhat (numpy.ndarray) – (N, 1) shaped array of predicted y values
  • **scatterkwargs – keyword arguments to pass to Scatter or Scatter3D in plotly (dictionaries are merged recursively)
picturedrocks.plot.plotgeneheat(celldata, coords, genes, hide_clusts=False, **scatterkwargs)

Generate gene heat plot for some embedding of AnnData

This generates a figure with multiple dropdown options. The first option is “Clust” for a plot similar to genericplot, and the remaining dropdown options correspond to genes specified in genes. When celldata.genes is defined, these drop downs are labeled with the gene names.

Parameters:
  • celldata (anndata.AnnData) – data to plot
  • coords (numpy.ndarray) – (N, 2) or (N, 3) shaped coordinates of the embedded data (e.g., PCA or tSNE)
  • genes (list) – list of gene indices or gene names
  • hide_clusts (bool) – Determines if cluster labels are ignored even if they are available

Selecting Markers

PicturedRocks current implements two categories of marker selection algorithms:
  • mutual information-based algorithms
  • 1-bit compressed sensing based algorithms

Feature Selection Tutorial

In this Jupyter notebook, we’ll walk through the information-theoretic feature selection algorithms in PicturedRocks.

In [1]:
import numpy as np
import scanpy.api as sc
import picturedrocks as pr
In [2]:
adata = sc.datasets.paul15()
WARNING: In Scanpy 0.*, this returned logarithmized data. Now it returns non-logarithmized data.
... storing 'paul15_clusters' as categorical
In [3]:
adata
Out[3]:
AnnData object with n_obs × n_vars = 2730 × 3451
    obs: 'paul15_clusters'
    uns: 'iroot'

The process_clusts method copies the cluster column and precomputes various indices, etc. If you have multiple columns that can be used as target labels (e.g., different treatments, clusters via different clustering algorithms or parameters, or demographics), this sets and processes the given columns as the one we’re currently examining.

This is necessary for supervised analysis and visualization tools in PicturedRocks that use cluster labels.

In [4]:
pr.read.process_clusts(adata, "paul15_clusters")
Out[4]:
AnnData object with n_obs × n_vars = 2730 × 3451
    obs: 'paul15_clusters', 'clust', 'y'
    uns: 'iroot', 'num_clusts', 'clusterindices'

Normalize per cell and log transform the data

In [5]:
sc.pp.normalize_per_cell(adata)
In [6]:
sc.pp.log1p(adata)

The make_infoset method creates an InformationSet object with the following discretization: \([\log(X_{ij} + 1)]_{ij}\) where \(X_ij\) is the data matrix. It is useful to have only a small number of discrete states that each gene can take so that entropy is a reasonable measurement.

In [7]:
infoset = pr.markers.makeinfoset(adata, True)

Alternatively, we could have made our own InformationSet object with our own discretization. Here we re-create infoset by hand. Note: At this time, InformationSet cannot handle scipy’s sparse matrices.

In [8]:
X_disc = np.log2(adata.X + 1).round().astype(int)
In [9]:
infoset2 = pr.markers.mutualinformation.infoset.InformationSet(np.c_[X_disc, adata.obs['y']], True)
In [10]:
np.array_equal(infoset2.X, infoset.X)
Out[10]:
True

Because this dataset only has 3451 features, it is computationally easy to do feature selection without restricting the number of features. If we wanted to, we could do either supervised or unsupervised univariate feature selection (i.e., without considering any interactions between features).

In [11]:
# supervised
mim = pr.markers.mutualinformation.iterative.MIM(infoset)
most_relevant_genes = mim.autoselect(1000)
In [12]:
# unsupervised
ue = pr.markers.mutualinformation.iterative.UniEntropy(infoset)
most_variable_genes = ue.autoselect(1000)

At this stage, if we wanted to, we could slice our adata object as adata[:,most_relevant_genes] or adata[:,most_variable_genes] and create a new InformationSet object for this sliced object. We don’t need to do that here.

Supervised Feature Selection

Let’s jump straight into supervised feature selection. Here we will use the CIFE objective

In [13]:
cife = pr.markers.mutualinformation.iterative.CIFE(infoset)
In [14]:
cife.score
Out[14]:
array([0.06153028, 0.03873652, 0.10514698, ..., 0.13305092, 0.17305014,
       0.03877704])
In [15]:
top_genes = np.argsort(cife.score)[::-1]
print(adata.var_names[top_genes[:10]])
Index(['Ctsg', 'Prtn3', 'Car2', 'Elane', 'Car1', 'H2afy', 'Ermap', 'Mt2',
       'Fam132a', 'Klf1'],
      dtype='object')

Let’s select ‘Car1’

In [16]:
ind = adata.var_names.get_loc('Car1')
In [17]:
cife.add(ind)

Now, the top genes are

In [18]:
top_genes = np.argsort(cife.score)[::-1]
print(adata.var_names[top_genes[:10]])
Index(['Elane', 'Ctsg', 'Prtn3', 'Apoe', 'Fam132a', 'P4hb', 'Gstm1', 'H2afy',
       'Alas1', 'Car2'],
      dtype='object')

Observe that the order has changed based on redundancy (or lack thereof) with ‘Car1’. Let’s add ‘Ctsg’

In [19]:
ind = adata.var_names.get_loc('Ctsg')
cife.add(ind)
In [20]:
top_genes = np.argsort(cife.score)[::-1]
print(adata.var_names[top_genes[:10]])
Index(['Apoe', 'Mt1', 'Mcm5', 'Ran', 'Srm', 'Uqcrq', 'Psma7', 'Hspd1', 'Uqcr',
       'Cox5a'],
      dtype='object')

If we want to select the top gene at each step, we can use autoselect

In [21]:
cife.autoselect(10)

To look at the markers we’ve selected, we can examine cife.S

In [22]:
cife.S
Out[22]:
[552, 815, 305, 1953, 3002, 2666, 2479, 596, 2645, 785, 412, 1503]
In [23]:
adata.var_names[cife.S]
Out[23]:
Index(['Car1', 'Ctsg', 'Apoe', 'Mt1', 'Taldo1', 'S100a10', 'Ptprcap', 'Ccnd2',
       'Rps8', 'Crip1', 'Atpase6', 'Hspa8'],
      dtype='object')

This process can also done manually with a user-interface.

In [24]:
im = pr.markers.interactive.InteractiveMarkerSelection(adata, cife, dim_red="umap")
Running umap on genes...
/home/uvarma/miniconda3/envs/scpr/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning:

invalid value encountered in sqrt

Running umap on cells...
/home/uvarma/miniconda3/envs/scpr/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning:

invalid value encountered in sqrt

In [25]:
im.show()

Note, that because we passed the same cife object, any genes added/removed in the interface will affect the cife object.

In [26]:
adata.var_names[cife.S]
Out[26]:
Index(['Car1', 'Ctsg', 'Apoe', 'Mt1', 'Taldo1', 'S100a10', 'Ptprcap', 'Ccnd2',
       'Rps8', 'Crip1', 'Atpase6', 'Hspa8'],
      dtype='object')

Unsupervised Feature Selection

This works very similarly. In the example below, we’ll autoselect 5 genes and then run the interface. Note that although the previous section would not work without cluster labels, the following code will.

In [27]:
cife_unsup = pr.markers.mutualinformation.iterative.CIFEUnsup(infoset)
In [28]:
cife_unsup.autoselect(5)

(If you ran the example above, this will load faster because the t_SNE coordinates for genes and cells have already been computed. You can also disable one/both of these plots with keyword arguments (e.g., InteractiveMarkerSelection(..., show_genes=False))

In [29]:
im_unsup = pr.markers.interactive.InteractiveMarkerSelection(adata, cife_unsup, show_genes=False, dim_red="umap")
In [30]:
im_unsup.show()
In [ ]:

Mutual information

TODO: Explanation of how these work goes here.

Before running any mutual information based algorithms, we need a discretized version of the gene expression matrix, with a limited number of discrete values (because we do not make any assumptions about the distribution of gene expression). Such data is stored in picturedrocks.markers.InformationSet, but by default, we suggest using picturedrocks.markers.makeinfoset() to generate such an object after appropriate normalization

Iterative Feature Selection

All information-theoretic feature selection methods in PicturedRocks are greedy algorithms. In general, they implement the abstract class IterativeFeatureSelection class. See Supervised Feature Selection and Unsupervised Feature Selection for specific algorithms.

class picturedrocks.markers.mutualinformation.iterative.IterativeFeatureSelection(infoset)

Abstract Class for Iterative Feature Selection

add(ind)

Select specified feature

Parameters:ind (int) – Index of feature to select
autoselect(n_feats)

Auto select features

This automatically selects n_feats features greedily by selecting the feature with the highest score at each iteration.

Parameters:n_feats (int) – The number of features to select
remove(ind)

Remove specified feature

Parameters:ind (int) – Index of feature to remove
Supervised Feature Selection
class picturedrocks.markers.mutualinformation.iterative.MIM(infoset)
class picturedrocks.markers.mutualinformation.iterative.CIFE(infoset)
class picturedrocks.markers.mutualinformation.iterative.JMI(infoset)
Unsupervised Feature Selection
class picturedrocks.markers.mutualinformation.iterative.UniEntropy(infoset)
class picturedrocks.markers.mutualinformation.iterative.CIFEUnsup(infoset)

Auxiliary Classes and Methods

class picturedrocks.markers.InformationSet(X, has_y=False)

Stores discrete gene expression matrix

Parameters:
  • X (numpy.ndarray) – a (num_obs, num_vars) shape array with dtype int
  • has_y (bool) – whether the array X has a target label column (a y column) as its last column
picturedrocks.markers.makeinfoset(adata, include_y)

Discretize data

Parameters:
  • adata (anndata.AnnData) – The data to discretize. By default data is discretized as round(log2(X + 1)).
  • include_y (bool) – Determines if the y (cluster label) column in included in the InformationSet object
Returns:

An object that can be used to perform information theoretic calculations.

Return type:

picturedrocks.markers.mutualinformation.infoset.InformationSet

Interactive Marker Selection

class picturedrocks.markers.interactive.InteractiveMarkerSelection(adata, feature_selection, disp_genes=10, connected=True, show_cells=True, show_genes=True, dim_red='tsne')

Run an interactive marker selection GUI inside a jupyter notebook

Parameters:
  • adata (anndata.AnnData) – The data to run marker selection on. If you want to restrict to a small number of genes, slice your anndata object.
  • feature_selection (picturedrocks.markers.mutualinformation.iterative.IterativeFeatureSelection) – An instance of a interative feature selection algorithm class that corresponds to adata (i.e., the column indices in feature_selection should correspond to the column indices in adata)
  • disp_genes (int) – Number of genes to display as options (by default, number of genes plotted on the tSNE plot is 3 * disp_genes, but can be changed by setting the plot_genes property after initializing.
  • connected (bool) – Parameter to pass to plotly.offline.init_notebook_mode. If your browser does not have internet access, you should set this to False.
  • show_cells (bool) – Determines whether to display a tSNE plot of the cells with a drop-down menu to look at gene expression levels for candidate genes.
  • show_genes (bool) – Determines whether to display a tSNE plot of genes to visualize gene similarity
  • dim_red ({"tsne", "umap"}) – Dimensionality reduction algorithm

Warning

This class requires modules not explicitly listed as dependencies of picturedrocks. Specifically, please ensure that you have ipywidgets installed and that you use this class only inside a jupyter notebook.

picturedrocks.markers.interactive.cife_obj(H, i, S)

The CIFE objective function for feature selection

Parameters:
  • H (function) – an entropy function, typically the bound method H on an instance of InformationSet. For example, if infoset is of type picturedrocks.markers.InformationSet, then pass infoset.H
  • i (int) – index of candidate gene
  • S (list) – list of features already selected
Returns:

the candidate feature’s score relative to the selected gene set S

Return type:

float

Measuring Feature Selection Performance

This module can be used to evaluate feature selection methods via K-fold cross validation.

class picturedrocks.performance.FoldTester(adata)

Performs K-fold Cross Validation for Marker Selection

FoldTester can be used to evaluate various marker selection algorithms. It can split the data in K folds, run marker selection algorithms on these folds, and classify data based on testing and training data.

Parameters:adata (anndata.AnnData) – data to slice into folds
classify(classifier)

Classify each cell using training data from other folds

For each fold, we project the data onto the markers selected for that fold, which we treat as test data. We also project the complement of the fold and treat that as training data.

Parameters:classifier – a classifier that trains with a training data set and predicts labels of test data. See NearestCentroidClassifier for an example.
loadfolds(file)

Load folds from a file

The file can be one saved either by FoldTester.savefolds() or FoldTester.savefoldsandmarkers(). In the latter case, it will not load any markers.

loadfoldsandmarkers(file)

Load folds and markers

Loads a folds and markers file saved by FoldTester.savefoldsandmarkers()

Parameters:file (str) – filename to load from (typically with a .npz extension)
makefolds(k=5, random=False)

Makes folds

Parameters:
  • k (int) – the value of K
  • random (bool) – If true, makefolds will make folds randomly. Otherwise, the folds are made in order (i.e., the first ceil(N / k) cells in the first fold, etc.)
savefolds(file)

Save folds to a file

Parameters:file (str) – filename to save (typically with a .npz extension)
savefoldsandmarkers(file)

Save folds and markers for each fold

This saves folds, and for each fold, the markers previously found by FoldTester.selectmarkers().

Parameters:file (str) – filename to save to (typically with a .npz extension)
selectmarkers(select_function)

Perform a marker selection algorithm on each fold

Parameters:select_function (function) – a function that takes in an AnnData object and outputs a list of gene markers, given by their index
validatefolds()

Ensure that all observations are in exactly one fold

Returns:
Return type:bool
class picturedrocks.performance.NearestCentroidClassifier

Nearest Centroid Classifier for Cross Validation

Computes the centroid of each cluster label in the training data, then predicts the label of each test data point by finding the nearest centroid.

test(Xtest)
train(adata)
class picturedrocks.performance.PerformanceReport(y, yhat)

Report actual vs predicted statistics

Parameters:
  • y (numpy.ndarray) – actual cluster labels, (N, 1)-shaped numpy array
  • yhat (numpy.ndarray) – predicted cluster labels, (N, 1)-shaped numpy array
confusionmatrixfigure()

Compute and make a confusion matrix figure

Returns:confusion matrix
Return type:plotly figure
getconfusionmatrix()

Get the confusion matrix for the latest run

Returns:array of shape (K, K), with the [i, j] entry being the fraction of cells in cluster i that were predicted to be in cluster j
Return type:numpy.ndarray
printscore()

Print a message with the score

show()

Print a full report

This uses iplot, so we assume this will only be run in a Jupyter notebook and that init_notebook_mode has already been run.

wrong()

Returns the number of cells misclassified.

picturedrocks.performance.kfoldindices(n, k, random=False)

Generate indices for k-fold cross validation

Parameters:
  • n (int) – number of observations
  • k (int) – number of folds
  • random (bool) – determines whether to randomize the order
Yields:

numpy.ndarray – array of indices in each fold

Indices and tables