PicturedRocks–Single Cell RNA-seq Analysis Tool¶
PicturedRocks is a tool for the analysis of single cell RNA-seq data. Currently, we implement two marker selection approaches:
- a 1-bit compressed sensing based sparse SVM algorithm, and
- a mutual information-based greedy feature selection algorithm.
Installing¶
Please ensure you have Python 3.6 or newer and have numba and scikit-learn installed. The best way to get Python and various dependencies is with Anaconda or Miniconda. Once you have a conda environment, run conda install numba scikit-learn
. Then use pip to install PicturedRocks and all additional dependencies:
pip install picturedrocks
To install the latest code from github, clone our github repository. Once inside the project directory, instal by running pip install -e .
. The -e
option will point a symbolic link to the current directory instead of installing a copy on your system.
Reading data¶
In addition to various functions for reading input data in scanpy, various methods in picturedrocks need cluster labels.
-
picturedrocks.read.
process_clusts
(adata, copy=False)¶ Annotate with information about clusters
Precomputes cluster indices, number of clusters, etc.
Parameters: - adata (anndata.AnnData) –
- copy (bool) – determines whether a copy of AnnData object is returned
Returns: object with annotation
Return type: Notes
The information computed here is lost when saving as a .loom file. If a .loom file has cluster information, you should run this function immediately after
sc.read_loom
.
-
picturedrocks.read.
read_clusts
(adata, filename, sep=', ', copy=False)¶ Read cluster labels from a csv
Parameters: - adata (anndata.AnnData) – the AnnData object to read labels into
- filename (str) – filename of the csv file with labels
- sep (str, optional) – csv delimiter
- copy (bool) – determines whether a copy of AnnData object is returned
Returns: object with cluster labels
Return type: Notes
- Cluster ids will automatically be changed so they are 0-indexed
- csv can either be two columns (in which case the first column is treated
as observation label and merging handled by pandas) or one column (only
cluster labels, ordered as in
adata
)
Preprocessing¶
The preprocessing module provides basic preprocessing tools. To avoid reinventing the wheel, we won’t repeat methods already in scanpy unless we need functionality not available there.
-
picturedrocks.preprocessing.
pca
(data, dim=3, center=True, copy=False)¶ Runs PCA
Parameters: - data (anndata.AnnData) – input data
- dim (int, optional) – number of PCs to compute
- center (bool, optional) – determines whether to center data before running PCA
- copy – determines whether a copy of AnnData object is returned
Returns: object with
obsm["X_pca"]
,varm["PCs"]
, anduns["num_pcs"]
setReturn type:
Plotting¶
-
picturedrocks.plot.
genericplot
(celldata, coords, **scatterkwargs)¶ Generate a figure for some embedding of data
This function supports both 2D and 3D plots. This may be used to plot data for any embedding (e.g., PCA or t-SNE). For example usage, see code for pcafigure.
Parameters: - celldata (anndata.AnnData) – data to plot
- coords (numpy.ndarray) – (N, 2) or (N, 3) shaped coordinates of the embedded data
- **scatterkwargs – keyword arguments to pass to
Scatter
orScatter3D
in plotly (dictionaries are merged recursively)
-
picturedrocks.plot.
genericwrongplot
(celldata, coords, yhat, labels=None, **scatterkwargs)¶ Plot figure with incorrectly classified points highlighted
This can be used with any 2D or 3D embedding (e.g., PCA or t-SNE). For example code, see pcawrongplot.
Parameters: - celldata (anndata.AnnData) – data to plot
- coords (numpy.ndarray) – (N, 2) or (N, 3) shaped array with coordinates to plot
- yhat (numpy.ndarray) – (N, 1) shaped array of predicted y values
- labels (list, optional) – list of axis titles
- **scatterkwargs – keyword arguments to pass to
Scatter
orScatter3D
in plotly (dictionaries are merged recursively)
-
picturedrocks.plot.
pcafigure
(celldata, **scatterkwargs)¶ Make a 3D PCA figure for an AnnData object
Parameters: - celldata (anndata.AnnData) – data to plot
- **scatterkwargs – keyword arguments to pass to
Scatter
orScatter3D
in plotly (dictionaries are merged recursively)
-
picturedrocks.plot.
pcawrongplot
(celldata, yhat, **scatterkwargs)¶ Generate a 3D PCA figure with incorrectly classified points highlighted
Parameters: - celldata (anndata.AnnData) – data to plot
- yhat (numpy.ndarray) – (N, 1) shaped array of predicted y values
- **scatterkwargs – keyword arguments to pass to
Scatter
orScatter3D
in plotly (dictionaries are merged recursively)
-
picturedrocks.plot.
plotgeneheat
(celldata, coords, genes, hide_clusts=False, **scatterkwargs)¶ Generate gene heat plot for some embedding of AnnData
This generates a figure with multiple dropdown options. The first option is “Clust” for a plot similar to genericplot, and the remaining dropdown options correspond to genes specified in genes. When celldata.genes is defined, these drop downs are labeled with the gene names.
Parameters: - celldata (anndata.AnnData) – data to plot
- coords (numpy.ndarray) – (N, 2) or (N, 3) shaped coordinates of the embedded data (e.g., PCA or tSNE)
- genes (list) – list of gene indices or gene names
- hide_clusts (bool) – Determines if cluster labels are ignored even if they are available
Selecting Markers¶
- PicturedRocks current implements two categories of marker selection algorithms:
- mutual information-based algorithms
- 1-bit compressed sensing based algorithms
Mutual information¶
TODO: Explanation of how these work goes here.
Before running any mutual information based algorithms, we need a discretized
version of the gene expression matrix, with a limited number of discrete
values (because we do not make any assumptions about the distribution of gene
expression). Such data is stored in
picturedrocks.markers.InformationSet
, but by default, we suggest
using picturedrocks.markers.makeinfoset()
to generate such an object
after appropriate normalization
-
class
picturedrocks.markers.mutualinformation.iterative.
MIM
(infoset)¶
-
class
picturedrocks.markers.mutualinformation.iterative.
CIFE
(infoset)¶
-
class
picturedrocks.markers.mutualinformation.iterative.
JMI
(infoset)¶
-
class
picturedrocks.markers.mutualinformation.iterative.
UniEntropy
(infoset)¶
-
class
picturedrocks.markers.mutualinformation.iterative.
CIFEUnsup
(infoset)¶
Auxiliary Classes and Methods¶
-
class
picturedrocks.markers.
InformationSet
(X, has_y=False)¶ Stores discrete gene expression matrix
Parameters: - X (numpy.ndarray) – a (num_obs, num_vars) shape array with
dtype
int
- has_y (bool) – whether the array X has a target label column (a y column) as its last column
- X (numpy.ndarray) – a (num_obs, num_vars) shape array with
-
picturedrocks.markers.
makeinfoset
(adata, include_y)¶ Discretize data
Parameters: - adata (anndata.AnnData) – The data to discretize. By default data is discretized as round(log2(X + 1)).
- include_y (bool) – Determines if the y (cluster label) column in included in the InformationSet object
Returns: An object that can be used to perform information theoretic calculations.
Return type: picturedrocks.markers.mutualinformation.infoset.InformationSet
Interactive Marker Selection¶
-
class
picturedrocks.markers.interactive.
InteractiveMarkerSelection
(adata, feature_selection, disp_genes=10, connected=True, show_cells=True, show_genes=True)¶ Run an interactive marker selection GUI inside a jupyter notebook
Parameters: - adata (anndata.AnnData) – The data to run marker selection on. If you want to restrict to a small number of genes, slice your anndata object.
- feature_selection (picturedrocks.markers.mutualinformation.iterative.IterativeFeatureSelection) – An instance of a interative feature selection algorithm class that corresponds to adata (i.e., the column indices in feature_selection should correspond to the column indices in adata)
- disp_genes (int) – Number of genes to display as options (by default, number of genes plotted on the tSNE plot is 3 * disp_genes, but can be changed by setting the plot_genes property after initializing.
- connected (bool) – Parameter to pass to plotly.offline.init_notebook_mode. If your browser does not have internet access, you should set this to False.
- show_cells (bool) – Determines whether to display a tSNE plot of the cells with a drop-down menu to look at gene expression levels for candidate genes.
- show_genes (bool) – Determines whether to display a tSNE plot of genes to visualize gene similarity
Warning
This class requires modules not explicitly listed as dependencies of picturedrocks. Specifically, please ensure that you have ipywidgets installed and that you use this class only inside a jupyter notebook.
-
picturedrocks.markers.interactive.
cife_obj
(H, i, S)¶ The CIFE objective function for feature selection
Parameters: Returns: the candidate feature’s score relative to the selected gene set S
Return type:
Measuring Feature Selection Performance¶
This module can be used to evaluate feature selection methods via K-fold cross validation.
-
class
picturedrocks.performance.
FoldTester
(adata)¶ Performs K-fold Cross Validation for Marker Selection
FoldTester
can be used to evaluate various marker selection algorithms. It can split the data in K folds, run marker selection algorithms on these folds, and classify data based on testing and training data.Parameters: adata (anndata.AnnData) – data to slice into folds -
classify
(classifier)¶ Classify each cell using training data from other folds
For each fold, we project the data onto the markers selected for that fold, which we treat as test data. We also project the complement of the fold and treat that as training data.
Parameters: classifier – a classifier that trains with a training data set and predicts labels of test data. See NearestCentroidClassifier for an example.
-
loadfolds
(file)¶ Load folds from a file
The file can be one saved either by
FoldTester.savefolds()
orFoldTester.savefoldsandmarkers()
. In the latter case, it will not load any markers.See also
-
loadfoldsandmarkers
(file)¶ Load folds and markers
Loads a folds and markers file saved by
FoldTester.savefoldsandmarkers()
Parameters: file (str) – filename to load from (typically with a .npz
extension)See also
-
makefolds
(k=5, random=False)¶ Makes folds
Parameters:
-
savefolds
(file)¶ Save folds to a file
Parameters: file (str) – filename to save (typically with a .npz
extension)
-
savefoldsandmarkers
(file)¶ Save folds and markers for each fold
This saves folds, and for each fold, the markers previously found by
FoldTester.selectmarkers()
.Parameters: file (str) – filename to save to (typically with a .npz
extension)
-
-
class
picturedrocks.performance.
NearestCentroidClassifier
¶ Nearest Centroid Classifier for Cross Validation
Computes the centroid of each cluster label in the training data, then predicts the label of each test data point by finding the nearest centroid.
-
test
(Xtest)¶
-
train
(adata)¶
-
-
class
picturedrocks.performance.
PerformanceReport
(y, yhat)¶ Report actual vs predicted statistics
Parameters: - y (numpy.ndarray) – actual cluster labels, (N, 1)-shaped numpy array
- yhat (numpy.ndarray) – predicted cluster labels, (N, 1)-shaped numpy array
-
confusionmatrixfigure
()¶ Compute and make a confusion matrix figure
Returns: confusion matrix Return type: plotly figure
-
getconfusionmatrix
()¶ Get the confusion matrix for the latest run
Returns: array of shape (K, K), with the [i, j] entry being the fraction of cells in cluster i that were predicted to be in cluster j Return type: numpy.ndarray
-
printscore
()¶ Print a message with the score
-
show
()¶ Print a full report
This uses iplot, so we assume this will only be run in a Jupyter notebook and that init_notebook_mode has already been run.
-
wrong
()¶ Returns the number of cells misclassified.