PicturedRocks–Single Cell RNA-seq Analysis Tool

PicturedRocks is a tool for the analysis of single cell RNA-seq data. Currently, we implement two marker selection approaches:

  • a 1-bit compressed sensing based sparse SVM algorithm, and
  • a mutual information-based greedy feature selection algorithm.

Installing

Please ensure you have Python 3.6 or newer and have numba and scikit-learn installed. The best way to get Python and various dependencies is with Anaconda or Miniconda. Once you have a conda environment, run conda install numba scikit-learn. Then use pip to install PicturedRocks and all additional dependencies:

pip install picturedrocks

To install the latest code from github, clone our github repository. Once inside the project directory, instal by running pip install -e .. The -e option will point a symbolic link to the current directory instead of installing a copy on your system.

Reading data

In addition to various functions for reading input data in scanpy, various methods in picturedrocks need cluster labels.

picturedrocks.read.process_clusts(adata, copy=False)

Annotate with information about clusters

Precomputes cluster indices, number of clusters, etc.

Parameters:
  • adata (anndata.AnnData) –
  • copy (bool) – determines whether a copy of AnnData object is returned
Returns:

object with annotation

Return type:

anndata.AnnData

Notes

The information computed here is lost when saving as a .loom file. If a .loom file has cluster information, you should run this function immediately after sc.read_loom.

picturedrocks.read.read_clusts(adata, filename, sep=', ', copy=False)

Read cluster labels from a csv

Parameters:
  • adata (anndata.AnnData) – the AnnData object to read labels into
  • filename (str) – filename of the csv file with labels
  • sep (str, optional) – csv delimiter
  • copy (bool) – determines whether a copy of AnnData object is returned
Returns:

object with cluster labels

Return type:

anndata.AnnData

Notes

  • Cluster ids will automatically be changed so they are 0-indexed
  • csv can either be two columns (in which case the first column is treated as observation label and merging handled by pandas) or one column (only cluster labels, ordered as in adata)

Preprocessing

The preprocessing module provides basic preprocessing tools. To avoid reinventing the wheel, we won’t repeat methods already in scanpy unless we need functionality not available there.

picturedrocks.preprocessing.pca(data, dim=3, center=True, copy=False)

Runs PCA

Parameters:
  • data (anndata.AnnData) – input data
  • dim (int, optional) – number of PCs to compute
  • center (bool, optional) – determines whether to center data before running PCA
  • copy – determines whether a copy of AnnData object is returned
Returns:

object with obsm["X_pca"], varm["PCs"], and uns["num_pcs"] set

Return type:

anndata.AnnData

Plotting

picturedrocks.plot.genericplot(celldata, coords, **scatterkwargs)

Generate a figure for some embedding of data

This function supports both 2D and 3D plots. This may be used to plot data for any embedding (e.g., PCA or t-SNE). For example usage, see code for pcafigure.

Parameters:
  • celldata (anndata.AnnData) – data to plot
  • coords (numpy.ndarray) – (N, 2) or (N, 3) shaped coordinates of the embedded data
  • **scatterkwargs – keyword arguments to pass to Scatter or Scatter3D in plotly (dictionaries are merged recursively)
picturedrocks.plot.genericwrongplot(celldata, coords, yhat, labels=None, **scatterkwargs)

Plot figure with incorrectly classified points highlighted

This can be used with any 2D or 3D embedding (e.g., PCA or t-SNE). For example code, see pcawrongplot.

Parameters:
  • celldata (anndata.AnnData) – data to plot
  • coords (numpy.ndarray) – (N, 2) or (N, 3) shaped array with coordinates to plot
  • yhat (numpy.ndarray) – (N, 1) shaped array of predicted y values
  • labels (list, optional) – list of axis titles
  • **scatterkwargs – keyword arguments to pass to Scatter or Scatter3D in plotly (dictionaries are merged recursively)
picturedrocks.plot.pcafigure(celldata, **scatterkwargs)

Make a 3D PCA figure for an AnnData object

Parameters:
  • celldata (anndata.AnnData) – data to plot
  • **scatterkwargs – keyword arguments to pass to Scatter or Scatter3D in plotly (dictionaries are merged recursively)
picturedrocks.plot.pcawrongplot(celldata, yhat, **scatterkwargs)

Generate a 3D PCA figure with incorrectly classified points highlighted

Parameters:
  • celldata (anndata.AnnData) – data to plot
  • yhat (numpy.ndarray) – (N, 1) shaped array of predicted y values
  • **scatterkwargs – keyword arguments to pass to Scatter or Scatter3D in plotly (dictionaries are merged recursively)
picturedrocks.plot.plotgeneheat(celldata, coords, genes, hide_clusts=False, **scatterkwargs)

Generate gene heat plot for some embedding of AnnData

This generates a figure with multiple dropdown options. The first option is “Clust” for a plot similar to genericplot, and the remaining dropdown options correspond to genes specified in genes. When celldata.genes is defined, these drop downs are labeled with the gene names.

Parameters:
  • celldata (anndata.AnnData) – data to plot
  • coords (numpy.ndarray) – (N, 2) or (N, 3) shaped coordinates of the embedded data (e.g., PCA or tSNE)
  • genes (list) – list of gene indices or gene names
  • hide_clusts (bool) – Determines if cluster labels are ignored even if they are available

Selecting Markers

PicturedRocks current implements two categories of marker selection algorithms:
  • mutual information-based algorithms
  • 1-bit compressed sensing based algorithms

Mutual information

TODO: Explanation of how these work goes here.

Before running any mutual information based algorithms, we need a discretized version of the gene expression matrix, with a limited number of discrete values (because we do not make any assumptions about the distribution of gene expression). Such data is stored in picturedrocks.markers.InformationSet, but by default, we suggest using picturedrocks.markers.makeinfoset() to generate such an object after appropriate normalization

class picturedrocks.markers.mutualinformation.iterative.MIM(infoset)
class picturedrocks.markers.mutualinformation.iterative.CIFE(infoset)
class picturedrocks.markers.mutualinformation.iterative.JMI(infoset)
class picturedrocks.markers.mutualinformation.iterative.UniEntropy(infoset)
class picturedrocks.markers.mutualinformation.iterative.CIFEUnsup(infoset)

Auxiliary Classes and Methods

class picturedrocks.markers.InformationSet(X, has_y=False)

Stores discrete gene expression matrix

Parameters:
  • X (numpy.ndarray) – a (num_obs, num_vars) shape array with dtype int
  • has_y (bool) – whether the array X has a target label column (a y column) as its last column
picturedrocks.markers.makeinfoset(adata, include_y)

Discretize data

Parameters:
  • adata (anndata.AnnData) – The data to discretize. By default data is discretized as round(log2(X + 1)).
  • include_y (bool) – Determines if the y (cluster label) column in included in the InformationSet object
Returns:

An object that can be used to perform information theoretic calculations.

Return type:

picturedrocks.markers.mutualinformation.infoset.InformationSet

Interactive Marker Selection

class picturedrocks.markers.interactive.InteractiveMarkerSelection(adata, feature_selection, disp_genes=10, connected=True, show_cells=True, show_genes=True)

Run an interactive marker selection GUI inside a jupyter notebook

Parameters:
  • adata (anndata.AnnData) – The data to run marker selection on. If you want to restrict to a small number of genes, slice your anndata object.
  • feature_selection (picturedrocks.markers.mutualinformation.iterative.IterativeFeatureSelection) – An instance of a interative feature selection algorithm class that corresponds to adata (i.e., the column indices in feature_selection should correspond to the column indices in adata)
  • disp_genes (int) – Number of genes to display as options (by default, number of genes plotted on the tSNE plot is 3 * disp_genes, but can be changed by setting the plot_genes property after initializing.
  • connected (bool) – Parameter to pass to plotly.offline.init_notebook_mode. If your browser does not have internet access, you should set this to False.
  • show_cells (bool) – Determines whether to display a tSNE plot of the cells with a drop-down menu to look at gene expression levels for candidate genes.
  • show_genes (bool) – Determines whether to display a tSNE plot of genes to visualize gene similarity

Warning

This class requires modules not explicitly listed as dependencies of picturedrocks. Specifically, please ensure that you have ipywidgets installed and that you use this class only inside a jupyter notebook.

picturedrocks.markers.interactive.cife_obj(H, i, S)

The CIFE objective function for feature selection

Parameters:
  • H (function) – an entropy function, typically the bound method H on an instance of InformationSet. For example, if infoset is of type picturedrocks.markers.InformationSet, then pass infoset.H
  • i (int) – index of candidate gene
  • S (list) – list of features already selected
Returns:

the candidate feature’s score relative to the selected gene set S

Return type:

float

Measuring Feature Selection Performance

This module can be used to evaluate feature selection methods via K-fold cross validation.

class picturedrocks.performance.FoldTester(adata)

Performs K-fold Cross Validation for Marker Selection

FoldTester can be used to evaluate various marker selection algorithms. It can split the data in K folds, run marker selection algorithms on these folds, and classify data based on testing and training data.

Parameters:adata (anndata.AnnData) – data to slice into folds
classify(classifier)

Classify each cell using training data from other folds

For each fold, we project the data onto the markers selected for that fold, which we treat as test data. We also project the complement of the fold and treat that as training data.

Parameters:classifier – a classifier that trains with a training data set and predicts labels of test data. See NearestCentroidClassifier for an example.
loadfolds(file)

Load folds from a file

The file can be one saved either by FoldTester.savefolds() or FoldTester.savefoldsandmarkers(). In the latter case, it will not load any markers.

loadfoldsandmarkers(file)

Load folds and markers

Loads a folds and markers file saved by FoldTester.savefoldsandmarkers()

Parameters:file (str) – filename to load from (typically with a .npz extension)
makefolds(k=5, random=False)

Makes folds

Parameters:
  • k (int) – the value of K
  • random (bool) – If true, makefolds will make folds randomly. Otherwise, the folds are made in order (i.e., the first ceil(N / k) cells in the first fold, etc.)
savefolds(file)

Save folds to a file

Parameters:file (str) – filename to save (typically with a .npz extension)
savefoldsandmarkers(file)

Save folds and markers for each fold

This saves folds, and for each fold, the markers previously found by FoldTester.selectmarkers().

Parameters:file (str) – filename to save to (typically with a .npz extension)
selectmarkers(select_function)

Perform a marker selection algorithm on each fold

Parameters:select_function (function) – a function that takes in an AnnData object and outputs a list of gene markers, given by their index
validatefolds()

Ensure that all observations are in exactly one fold

Returns:
Return type:bool
class picturedrocks.performance.NearestCentroidClassifier

Nearest Centroid Classifier for Cross Validation

Computes the centroid of each cluster label in the training data, then predicts the label of each test data point by finding the nearest centroid.

test(Xtest)
train(adata)
class picturedrocks.performance.PerformanceReport(y, yhat)

Report actual vs predicted statistics

Parameters:
  • y (numpy.ndarray) – actual cluster labels, (N, 1)-shaped numpy array
  • yhat (numpy.ndarray) – predicted cluster labels, (N, 1)-shaped numpy array
confusionmatrixfigure()

Compute and make a confusion matrix figure

Returns:confusion matrix
Return type:plotly figure
getconfusionmatrix()

Get the confusion matrix for the latest run

Returns:array of shape (K, K), with the [i, j] entry being the fraction of cells in cluster i that were predicted to be in cluster j
Return type:numpy.ndarray
printscore()

Print a message with the score

show()

Print a full report

This uses iplot, so we assume this will only be run in a Jupyter notebook and that init_notebook_mode has already been run.

wrong()

Returns the number of cells misclassified.

picturedrocks.performance.kfoldindices(n, k, random=False)

Generate indices for k-fold cross validation

Parameters:
  • n (int) – number of observations
  • k (int) – number of folds
  • random (bool) – determines whether to randomize the order
Yields:

numpy.ndarray – array of indices in each fold

Indices and tables