cytominer-database¶
cytominer-database provides command-line tools for organizing measurements extracted from images.
Software tools such as CellProfiler can extract hundreds of measurements from millions of cells in a typical high-throughput imaging experiment. The measurements are stored across thousands of CSV files.
cytominer-database helps you organize these data into a single database backend, such as SQLite.
Why cytominer-database?¶
While tools like CellProfiler can store measurements directly in databases, it is usually infeasible to create a centralized database in which to store these measurements. A more scalable approach is to create a set of CSVs per “batch” of images, and then later merge these CSVs.
cytominer-database ingest reads these CSVs, checks for errors, then ingests them into a database backend, including SQLite, MySQL, PostgresSQL, and several other backends supported by odo.
cytominer-database ingest source_directory sqlite:///backend.sqlite -c ingest_config.ini
will ingest the CSV files nested under source_directory into a SQLite backend
Configuration¶
[filenames]
image = image.csv
object = object.csv
experiment = Experiment.csv
Reference¶
ingest¶
A mechanism to ingest CSV files into a database.
In morphological profiling experiments, a CellProfiler pipeline is often run in parallel across multiple images and produces a set of CSV files. For example, imaging a 384-well plate, with 9 sites per well, produces 384 * 9 images; a CellProfiler process may be run on each image, resulting in a 384*9 output directories (each directory typically contains one CSV file per compartment (e.g. Cells.csv, Cytoplasm.csv, Nuclei.csv) and one CSV file for per-image measurements (e.g. Image.csv).
cytominer_database.ingest.seed
can be used to read all these CSV files into a database backend. SQLite is the
recommended engine, but ingest will likely also work with PostgreSQL and MySQL.
cytominer_database.ingest.seed
assumes a directory structure like shown below:
Example:
import cytominer_database.ingest
cytominer_database.ingest.seed(source, target, config)
-
cytominer_database.ingest.
checksum
(pathname, buffer_size=65536)[source]¶ Generate a 32-bit unique identifier for a file.
Parameters: - pathname – input file
- buffer_size – buffer size
-
cytominer_database.ingest.
into
(input, output, name, identifier, skip_table_prefix=False)[source]¶ Ingest a CSV file into a table in a database.
Parameters: - input – Input CSV file.
- output – Connection string for the database.
- name – Table in database into which the CSV file will be ingested
- identifier – Unique identifier for
input
. - skip_table_prefix – True if the prefix of the table name should be excluded from the names of columns.
-
cytominer_database.ingest.
seed
(source, target, config_file, skip_image_prefix=True)[source]¶ Read CSV files into a database backend.
Parameters: - config_file – Configuration file.
- source – Directory containing subdirectories that contain CSV files.
- target – Connection string for the database.
- skip_image_prefix – True if the prefix of image table name should be excluded from the names of columns from per image table
munge¶
-
cytominer_database.munge.
munge
(config_file, source, target=None)[source]¶ Searches
source
for directories containing a CSV file corresponding to per-object measurements, then splits the CSV file into one CSV file per compartment.For instance, the CSV file may comprise of measurements combined across Cells, Cytoplasm, and Nuclei.
munge
will split this CSV file into 3 CSV files: Cells.csv, Cytoplasm.csv, and Nuclei.csv.Parameters: - config_file – Configuration file.
- source – Directory containing subdirectories that contain an object CSV file.
- target – Output directory. If not specified, then it is same as
source
.
Returns: list of subdirectories that have an object CSV file.
Example:
import cytominer_database.munge cytominer_database.munge.munge(source, target, config)
utils¶
-
cytominer_database.utils.
collect_csvs
(config, directory)[source]¶ Collect CSV files from a directory.
This function collects CSV files in a directory, excluding those that have been specified in the configuration file. This enables collecting only those CSV files that correspond to cellular compartments. e.g. Cells.csv, Cytoplasm.csv, Nuclei.csv. CSV files corresponding to experiment, image, or object will be excluded.
Parameters: - config – configuration file.
- directory – directory containing the CSV files.
Returns: a list of CSV files.
-
cytominer_database.utils.
find_directories
(directory)[source]¶ List subdirectories.
Parameters: directory – directory Returns: list of subdirectories of directory
-
cytominer_database.utils.
read_config
(filename)[source]¶ Read a configuration file. A default config file is read first, and the values are overriden by those in the specified configuration file.
Parameters: filename – configuration filename Returns: a configuration object
-
cytominer_database.utils.
validate_csv
(csvfile)[source]¶ Validate a CSV file.
The CSV file typically corresponds to either a measurement made on a compartment, e.g. Cells.csv, or on an image, e.g. Image.csv. The validation performed is generic - it simply checks for malformed CSV files.
This uses csvclean to check for validity of a CSV file.
Parameters: csvfile – CSV file to validate Returns: True if valid, False otherwise.
-
cytominer_database.utils.
validate_csv_set
(config, directory)[source]¶ Validate a set of CSV files.
This function validates a set of CSV files in a directory. These CSV files correspond to measurements made on different cellular compartments, e.g. Cells.csv, Cytoplasm.csv, Nuclei.csv. An Image.csv file, corresponding to measurements made on the whole image, along with metadata, is also typically present.
Parameters: - config – configuration file - this contains the set of CSV files to validate.
- directory – directory containing the CSV files.
Returns: a tuple where the first element is the list of compartment CSV files, the second is the image CSV file.