Welcome to COMMAND>_’s documentation!

COMMAND>_ is a web-based application used to download, collect and manage gene expression data from public databases.

Main features are.

  • Easy installation and update using Docker Compose technology.
  • Graphical User Interface (GUI) for parsing and importing gene expression data.
  • Default Python scripts for easy parsing/importing of the most common microarray platforms (Affymetrix, Nimblegen, two-colors, etc.) and dedicated scripting editor for allowing flexible importing of any kind of gene expression data.
  • Automatic pre-processing (downloading, trimming, mapping and counting) of bulk RNA-Seq data.
  • Exporting of the collected data.

Note

Give it a try on https://command.fmach.it:4242 using:

  • username: guest
  • password: demo

Check out the Use Cases!

What is COMMAND>_?

COMMAND>_[#f1]_ is an acronym that stands for COMpendia MANagement Desktop. It is the software used for the creation of several gene expression compendia such as COLOMBOS [2] and VESPUCCI [3]. Despite being used since 2010 it has been made publicly available for anyone to use only in 2018, after having been completely rewritten. COMMAND>_ was originally conceived for the collection (and integration) of prokaryotes microarray experiments. As time goes by it has been evolved to allow also RNA-seq experiment to be imported and other species to be managed. With the current implementation COMMAND>_ is still meant for gene expression data collection but can be easily extended to support other kind of quantitative measurement technology (have a look at COMMAND>_ for developers).

COMMAND>_ is a Python web application developed using the Django framework for the backend, while the web interface has been developed using ExtJS with a look and feel typical of desktop applications. With COMMAND>_ you can search and download experiment from public gene expression databases, such as GEO , ArrayExpress or SRA, parse downloaded files to extract only valuable information, preview parsed data and import experiment data into a database. The pivotal point is the usage of custom Python scripts to mine only the relevant information. Scripts can be created or modified directly within the interface and are responsible to parse input files and populate each part of the data model (see Database schema), i.e. measurement data and meta-data for experiment, platforms and samples.

For microarray platforms it would be necessary to map probes to genes but before this step genes have first to be imported. COMMAND>_ allow to perform both these steps. For the latter it would be simply a matter of uploading a FASTA file with gene sequences (see data_collection), while for the former a BLAST alignment followed by a two-step filtering will be performed. In this way the microarray gets annotated with the latest available information enhancing the homogeneity since all microarrays will be annotated using the same gene list (see also map_feature).

References

[1]Moretto, M. et al. (2019). First step toward gene expression data integration: transcriptomic data acquisition with COMMAND>_. BMC bioinformatics, 20(1), 54.
[2]Moretto, M. et al. (2015). COLOMBOS v3. 0: leveraging gene expression compendia for cross-species analyses. Nucleic acids research, 44(D1), D620-D623.
[3]Moretto, M. et al. (2016). VESPUCCI: exploring patterns of gene expression in grapevine. Frontiers in plant science, 7, 633.

Getting start with COMMAND>_

Getting my user id and password

If you are using the public COMMAND>_ instance on https://command.fmach.it:4242 you can login using:

  • username: guest
  • password: demo

This is a user with restricted privileges meant for demonstration purpose only. If you have your running instance of COMMAND>_ (see Deploy) you will be able to first login using:

  • username: admin
  • password: admin

Now you can change the admin password, create new users and assign them privileges following the instructions in Admin.

_images/login.png

Set up and select a compendium

The first thing to do is creating a new empty compendium. Go to Admin (top bar) > Compendium Manager > Create Compendium (bottom-left corner + icon) and follow the instructions at Admin.

_images/comp01.png

Now that a new compendium has been set up you need to retrieve a FASTA file containing the gene ids and sequences for the species you want to study.

Tip

For example you can visit the NCBI Nucleotide database and get the coding sequences for the organism of interest. This file is mandatory for blasting and mapping respectively in either microarray or RNA-Seq experiments. In order to import it into COMMAND>_ go to Data collection (on the top-left corner) > Biological features, then select Import biological features from the bottom-left + icon.

_images/comp02.png

Now your gene annotation file has been imported and you can start looking for interesting experiments (both microarray and RNA-Seq) related to the organism of interest.

Searching public databases

After a new empty compendium has been created and a species of interest selected the user can start looking for collections of samples (from one or more experiments) from public databases (GEO, ArrayExpress or, in case of RNA-Seq experiments, SRA): Go to > Data collection (on the top-left corner) then > Experiments > New Experiment (on the bottom right corner) > from public DB.

_images/search01.png
  • In the Search options field of the dialog ‘Download from Public DB’ select the DB (here GEO) and the term of interest, either a description (e.g. Leukemia b-cell, Vitis vinifera, erc.) or directly a GSE ID.
  • From the list select an experiment of interest and click the download button.

Tip

You can download multiple experiments at the same time.

_images/search02.png
  • After a while, depending of the number of samples in the selected experiment(s) you have your experiment downloaded.

Tip

  • Check Message log frequently.
  • Inspect the Experiments section to see which experiments are available, yet to be parsed or already imported.

Now you can start parse and import some experiment (see Use Cases).

Message log

The Message Log page (Top > Options > Message Log) allows the user to take an eye on every activity of COMMAND>_. Check it frequently!

admin

Admin

The admin interface is visible only to admin users that have complete access to COMMAND>_ functionalities and compendia.

Users/Group manager

admin

User manager page

The user menu allow to create, remove and modify users. Moreover, an admin user can assign users to groups and set privileges to them. Group privileges are compendium-specific, i.e. we can for example restrict access only to some compendia and avoid users belonging to a group to see the others. For those compendia we can limit some functionalities, for example we could avoid users to run Python script or import experiments.

admin

Permission manager page

Compendium manager

admin

Compendium manager page

From this page a compendium can be created, modified, deleted and initialized. From a technical point of view a compendium is nothing more than a database schema. When an admin user creates a new compendium he will be asked to add all the information necessary to connect to such database.

admin

New compendium page

Once the connection information are saved and a new compendium appear in the grid, it would be possible to initialize it, i.e. to create the database schema.

Note

In this way it is possible to have compendia hosted on different database server. If the database do not exists yet it will be possible to have COMMAND>_ to create it on the fly but you will be asked to provide username and password for a database admin user. Default database admin user is postgres with password postgres.

The Compendium Type section is read-only and at the moment is filled only with gene expression since it is the only type of compendium you are allowed to create. To extend COMMAND>_ and allow other kind of quantitative data to be collected please have a look at COMMAND>_ for developers.

Deploy

COMMAND>_ is a complex application and relies on several other software components to work. In order to ease up the deployment process a docker-compose.yml file is provided, so assuming you have a working Docker Compose environment, the deployment process will be a matter of running a few commands.

In case you want to manually deploy COMMAND>_ in your environment there will be more steps you will need to take care of such as installing the web-server, the DBMS, etc.

Requirements

Have a look at the requirements.txt file for details. COMMAND>_ main dependencies are:

COMMAND>_ uses several external tools that you’ll need to download them separately:

Docker Compose

Assuming that you have Docker Compose correctly installed, you should be able to perform the following steps:

# 1. clone the repository
git clone https://github.com/marcomoretto/command.git

# 2. copy external dependencies (check figure below)

# 3. build
docker-compose build

# 4. start docker
docker-compose up -d

# 5. create database schema
docker-compose exec web python manage.py migrate

# 6. create admin user
docker-compose exec web python manage.py init_admin

# 7. create initial options
docker-compose exec web python manage.py init_options

# 8. create demo compendium
docker-compose exec web python manage.py init_demo_compendium demo

# 9. run daphne
docker-compose exec -d daphne daphne -b 0.0.0.0 -p 8001 cport.asgi:channel_layer

# 10. run worker
docker-compose exec -d worker python3 manage.py runworker
That’s it! You should be able to point your browser to http://localhost and login into COMMAND>_ using:
  • username: admin
  • password: admin

Note

You should have the following directory structure for the external tools

_images/external_dir_structure.png

Note

You might need to rename the directory from command to cport before doing step # 2.

Manual Deploy

One easy way to understand what you need to do to manually deploy COMMAND>_ is to have a look at 2 files:

In a nutshell, after having installed and configured Nginx (or another web-server to run Django applications), PostgreSQL, Redis, RabbitMQ and Celery, you’ll have to run:

pip3 install --upgrade pip
pip3 install Cython==0.28.1
pip3 install -r requirements.txt

Now you should be ready configure Django (check the documentation for details), create the database schema and run the application.

python manage.py migrate

python manage.py init_admin

python manage.py init_options

python manage.py init_demo_compendium demo

daphne -b 0.0.0.0 -p 8001 cport.asgi:channel_layer

python3 manage.py runworker

Note

COMMAND>_ id a Django application so refer to the Django docs for database configuration https://docs.djangoproject.com/en/1.11/ref/settings/

Database schema

_images/Database_schema.png

Use Cases

In this section we show how to both parse and import experiments from various gene expression platforms, technologies and sources (both public databases and local files) using the provided default scripts.

Use Case - Affymetrix from GEO

Import Gene Annotations

We want to look for experiments related to Yeast: the Saccharomyces Genome Database is the proper choice for retrieving sequences associated to Yeast’s genes (from this link). Go to > Data collection (on the top left corner) then > Biological features > Import biological feature (+ symbol on the bottom left) > Type: FASTA , File name: select the annotation file you downloaded before > Import Biological features. Wait.

_images/Case01_00.png
_images/Case01_01.png

We start by selecting Experiments from Data collection (top left corner) then we highlight the experiment of interest (it was previously retrieved from GEO following Searching public databases), here GSE8536, an expression analyses study which inspects the response of Saccharomyces cerevisiae to stress throughout a 15-day wine fermentation.

Parse Experiment, Platform and Samples

Since we have a new platform (GPL90) never imported before into COMMAND>_ for this compendium, we retrieve the sequences associated to the Affymetrix probe ids (YG_S98 probes) for this platform from the Affymetrix Support sitewebsite.

From Experiments (Data collection Menu) we highlight the selected experiment (GSE8536 here) and click the Parse/Import experiment from the bottom bar. On the main window you can see that the Experiment tab is populated with metadata gathered from the publicDB (GEO here). Now we can start parsing the Experiment, the Platform(s) and the Samples.

_images/Case01_02.png
_images/Case01_03.png

Being a dataset retrieved from GEO we take advantage of the .soft file downloaded (see GEO Documentation for a description of this type of file):

Select GSE8536_family.soft and click the Use assignment script to assign files to experiment entities icon on the bottom-right. A dialog will show-up:

  • Script > assign_all.py > Only selected files
  • Experiment tab > Script: > soft_experiment.py, Execution order: 1
  • Platform tab > Script: > soft_platform.py, Execution order: 1
  • Sample tab > Script: > soft_sample.py, Execution order: 1
  • Run assignment script
_images/Case01_03b.png

Now in order to parse the new platform we are going to use the sequences associated to the Affymetrix probe ids we have already downloaded. We import the annotation ( YG_S98 probes ) in the File assignment section of Experiment files clicking the upload icon on the bottom of the page.

Now we associate the file to the platform:

  • In Experiment files Section > File Assignment select the uploaded file (YG_S98.probe_tab) and click Use assignment script to assign files to experiment entities. On the Assign files dialog:
  • Script: assign_all.py
  • Param:
  • Only selected files checked (default)
  • Platform tab > Script: gpr_platform.py , Parameters: 0,Probe X|Probe Y,Probe Sequence , Execution order: 2
  • Run assignment script
_images/Case01_03c.png

Note

  • the Parameters assigned to the gpr_platform.py script specify to not skip any line, use the combination of Probe X and Probe Y columns to create an unique id for the cel files and indicate the sequences for the probes are in the Probe Sequence column.
  • The parsing of the Platform is a once time procedure: from now on we can use this platform for all related experiments.

Now we parse the Affymetrix cel files (sample files):

  • In Experiment files Section > File Assignment we use CEL as filter and select all files > click the Use assignment script to assign files to experiment entities icon on the bottom-right corner and the Assign files and scripts to experiment structure dialog will pop-up:
  • Script > match_entity_name.py
  • Only selected files (default) checked
  • Sample tab > Script: cell_sample.py, Execution order: 2
  • Run assignment script
_images/Case01_03d.png

Finally, in the Preview Section (Preview of GSE8536 here) click Run Selected (bottom-right corner). After a while your samples will be parsed.

Now you can Import both the Platform (since is the first time we use this specific one) and the Experiment.

Tip

Check that both the platform and the samples are properly parsed from the Preview interface of the Parse Experiment section clicking on the platform and on each sample.

Click the Import button on the bottom-right corner and select Import whole experiment. After a while the experiment and the platform (in this case) will be imported.

_images/Case01_04.png

Use Case - Nimblegen from ArrayExpress

In COMMAND>_ the preferred way to import experiments from public db is by using GEO which provide the most convenient interface out-of-the-box. In case an experiment is not included in GEO it is possible to import it from ArrayExpress. Start by searching the experiment of interest following the procedure described in Searching public databases, select E-GEOD-58806 as Term and ArrayExpress as Database. Go the experiment slide on the left, select the experiment of interest (here E-GEOD-58806 ) and click >_ Parse/Import experiment. On the main window you can see that the Experiment tab is populated with metadata gathered from the publicDB (ArrayExpress here).

Import Platform from GEO

COMMAND>_can use a previous imported platform from a different public database (either ArrayExpress or GEO) and assign it as Reporter platform (in the preview main section of Parsing) for the current experiment.

In our case we want to parse and import an experiment from ArrayExpress using a previously imported platform from GEO. In order to do so we import ONLY the platform for another experiment (here GSE32561) which uses the same platform of the experiment of interest. After the selection of the new experiment using the Searching from public db procedure we use the Nimblegen ndf files which allows to associate probes to sequences to the platform GPL14649.

Experiment files > File Assignment > Select GPL14649_071112_Ecoli_K12_EXP.ndf and in the Assign files dialog:

  • Script: match_entity_name.py
  • Param: platform
  • Only selected files checked (default)
  • Platform tab > Script: > gpr_platform.py; Parameters: 0,X|Y,PROBE_ID; Execution order: 2
  • Run assignment script
_images/Case02_01.png

Now we can import this platform from the Platform section of Preview:

_images/Case02_02.png

Parse Experiment, Platform and Samples

Now the Platform is available and can be used to import the experiment retrieved from ArrayExpress. Go to Experiments > Parse Experiment E-GEOD-58806 > Experiment Files > Platform and now click over A-GEOD-14649 in the Reporter Platform field and selected the previously imported GPL14649.

_images/Case02_03.png

Finally you parse and import the nimblegen .pair files:

  • In Experiment files Section > File Assignment > Filter .pair and select all files
  • click the Use assignment script to assign files to experiment entities icon on the bottom-right and the Assign files and scripts to experiment structure dialog will pop-up:
  • Script: match_sample_name.py > Only selected files
  • Sample: Script: > pair_sample.py, Execution order: 2
  • Run assignment script

Use Case - Multiplatform Experiment

It is standard practice for gene expressione esperiments to make use of multiple platforms for the same organism in the same experiment: usually it comes from multiple single experiments performed in different conditions/time. Here, we select from GEO the GSE13713 experiment regarding Phenotypic and transcriptomic analyses of mildly and severely salt-stressed Bacillus cereus ATCC. It is related to two platforms: GPL7634 and GPL7636.

Import Gene Annotation

Since the platforms related to the selected experiment were never imported before into COMMAND>_, we need the gene sequences in order to properly import our probes at gene level. We got gene/sequence list from ncbi: go here and from the top-right button select send to: Coding sequences, Format: FASTA Nucleotide and Choose destination: File. In COMMAND>_ go to > Data Collection (on the top left corner) then > Bio features (genes) > Import biological feature (+ symbol on the bottom left) > Type: FASTA , File name: select the annotation file you downloaded before > Import Biological features.

_images/Case03_01.png

Parse Platforms and Samples

In order to parse the two platforms, we need both the soft file related to the experiment and the soft_platform.py script.

In Experiment files Section > File Assignement > Select the GSE13713_family.soft file and on the Assign files dialog:

  • Script: match_all.py
  • Param: platform
  • Only selected files checked
  • Platform tab > Script: > soft_platform.py, parameters: True, Execution order: 1

In Experiment files Section > File Assignement > Select the .txt files (all Sultana in the Filter field) and on the Assign files dialog:

  • Script: match_entitye_name.py
  • Parameters: ch1
  • Only selected files checked

Platform tab

  • Script: gpr_sample.py
  • parameters: Gene name,Spot Mean Intensity (Cyanine5_060909_1136(1)),0
  • Execution order: 2

Do the same again for the ch2 but use as Parameters for Platform:

Platform tab

  • Script: gpr_sample.py
  • parameters: Gene name,Spot Mean Intensity (Cyanine3_060909_1136(1)),0
  • Execution order: 2

for Platform GPL10439:

  • In Experiment files Section > File Assignement > Select the .ndf file and on the Assign files dialog”:

    • Script: match_entity_type_param.py
    • Param: platform
    • Only selected files checked
    • Platform tab > Script: > soft_platform.py, Execution order: 2
  • In Experiment files Section > File Assignement > Select the .txt files (all pair files) and on the Assign files dialog:

    • Script: match_entity_name.py
  • Parameters: ch1

    • Only selected files checked
    • Platform tab > Script: > gpr_sample.py; Execution; order: 2
    • Parameters: ID_REF,Spot Mean Intensity (Alexa555_101810_0935(1)),0
  • Parameters: ch2

    • Only selected files checked
    • Platform tab > Script: > gpr_sample.py; Execution; order: 2
    • Parameters: ID_REF,Spot Mean Intensity (Alexa647_111510_1227(1))

Use Case - Import experiment from local file

In order to import an experiment which is not available from public repositories the user needs to provide:

  • a yaml file (see an example: here) containing the description of the experiment to be imported: The first row contains the Experiment id, the other rows start with the Platform id followed by the Samples ids.
  • a single compressed file (either zip or tar.gz) containing the raw data.

Go to Experiments > New Experiment (bottom-left) > From local file

_images/local_01.png

Fill the form which popped up starting with Experiment ID (the same contained in the yaml file, GSE13713 for the embedded example) then upload the yaml file (the system will take care to check if the format is ok), finally upload the compressed data. In a while your experiment is going to be imported.

_images/local_02.png
_images/local_03.png
_images/local_04.png

Use Case - RNA-Seq

Similarly to the microarray cases, RNA-Seq experiments can be retrieved from public database, specifically the Sequence Read Archive (SRA) , from the New Experiment/From public DB interface (bottom-left border icon). Here we select a small RNA-Seq experiment from SRA (PRJNA471071) where the authors employed a computational model of underground metabolism and laboratory evolution experiments to examine the role of enzyme promiscuity in the acquisition and optimization of growth on predicted non-native substrates in E. coli K-12 MG1655.

_images/CaseRNASeq_01.png

Indexing

The first step is to build the index for the quasi-alignment mapper (kallisto here [1]): select demo.fasta, It contains the sequences for the genes of the Escherichia coli genome and it is automatically build by COMMAND>_ when you begin parsing the data.

Use Assignment Script (bottom-right corner icon) > from the dialog:match_entity_name.py > Only selected files Experiment tab > Script: > kallisto_index.py, Execution order: 1 > Run assignment script

RNA-Seq pre-processing and summarization

Since the experiment is paired-end, the default script for preprocessing and summarization requires to indicate only one of the two paired files. You can do it using the filter and selecting *1.fastq, the script will take care of the rest.

Use Assignment Script (bottom-right corner icon) > from the dialog:match_entity_name.py > Only selected files Experiment tab > Script: > trim_quantify.py, Execution order: 1, Parameters: 1 (being a paired end)

_images/CaseRNASeq_05.png

Run assignment script

After a while all the sample will be preprocessed and summarized and the experiment can be imported from the Preview section: bottom-right corner > Import whole experiment.

Mapping probes and export the gene expression matrix

If you are done with importing experiments you can now map the probes to genes using BLAST [2] and a double filtering GUI of COMMAND>_. Go to Platform, select the platform to be mapped (e.g. GPL90 from the Affymetrix Use Case) and click the chain icon (map platform to biological features) on the bottom left corner.

Now you can use the dialog to run BLAST and filter the data (here we use the default settings).

When your are fine with filtering you can use one of the selected filtered objects and download the expression matrix going to Options > Export.

_images/MapProbes_01.png
_images/MapProbes_02.png

Tip

You can filter the data with different parameters, each set of parameters is saved in a specific slot.

References

[1]Nicolas L Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, 525–527 (2016), doi:10.1038/nbt.3519
[2]Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410.

Python parsing scripts

The Experiment Object, Platform Object and Sample Object are Python objects used as proxy to import a new experiment in the database

The file name of the experiment, platform or sample is stored in the variable named INPUT_FILE The name of the entity (experiment, platform or sample name) is stored in the variable named ENTITY_NAME To access parameters passed to each script use the list PARAMETERS Within each entity (experiment, platform or sample) you can choose the execution order of the script using the Order column.

To access Experiment Object use the EXPERIMENT_OBJECT variable in the Python script used with experiment files.

EXPERIMENT_OBJECT variables

EXPERIMENT_OBJECT.experiment_access_id: (string) the experiment access id EXPERIMENT_OBJECT.experiment_name: (string) the experiment name EXPERIMENT_OBJECT.scientific_paper_ref: (string) pubblication associated to the experiment EXPERIMENT_OBJECT.description: (string) the experiment description

To access Platform Object use the PLATFORM_OBJECT variable in the Python script used with platform files.

PLATFORM_OBJECT variables

PLATFORM_OBJECT.platform_access_id (string) the platform access id PLATFORM_OBJECT.platform_name (string) the platform name PLATFORM_OBJECT.platform_type (string) ‘microarray or rna-seq’ PLATFORM_OBJECT.description (string) the platform description PLATFORM_OBJECT.add_bio_feature_reporter_data(name, description, **kwargs): add a reporter to the platform

**kwargs are platform_type dependent. i.e. for ‘microarray’ they are probe_access_id, probe_set_name, probe_type and sequence

To access Sample Object use the SAMPLE_OBJECT variable in the Python script used with sample files.

SAMPLE_OBJECT variables and methods

SAMPLE_OBJECT.sample_name (string) the sample name SAMPLE_OBJECT.description (string) the sample description SAMPLE_OBJECT.add_raw_data(bio_feature_reporter_name, value): add raw data of this sample

parsing_scripts package

Subpackages

parsing_scripts.experiment package
Submodules
parsing_scripts.experiment.kallisto_index module
parsing_scripts.experiment.kallisto_index.main()

Create an index file for the KALLISTO software using the current BIOLOGICAL FEATURES

Biologial features for this compendium are putted into a FASTA file that is then indexed to be used for RNA-seq quantification using KALLISTO

PARAMETERS:
None
parsing_scripts.experiment.soft_experiment module
parsing_scripts.experiment.soft_experiment.main()

Parse a SOFT file and extract EXPERIMENT information

Looks for accession number, experiment name, scientific paper, experiment description

PARAMETERS:
None
Module contents
parsing_scripts.file_assignment package
Submodules
parsing_scripts.file_assignment.assign_all module
parsing_scripts.file_assignment.assign_all.assign(input_files, entity, entity_type, parameters)

Assign the selected input files (or all the files if checked) to every selected ENTITY

For each ENTITY (experiment, platforms or samples) for which a parsing script is selected, all the (selected) input files will be assigned regardless.

PARAMETERS:
None
parsing_scripts.file_assignment.match_entity_name module
parsing_scripts.file_assignment.match_entity_name.assign(input_files, entity, entity_type, parameters)

Assign the selected input files (or all the files if checked) to every ENTITY with matching NAME

For each ENTITY (experiment, platforms or samples) for which a parsing script is selected, only the (selected) input files with a name that match the one of the entity will be assinged (for example a file name GSE123.soft would match the experiment entity GSE123).

PARAMETERS:
None
Module contents
parsing_scripts.platform package
Submodules
parsing_scripts.platform.adf_platform module
parsing_scripts.platform.adf_platform.main()

Parse an ADF file and extract PLATFORM information

Looks for accession number, platform name, platform type and platform description

PARAMETERS:

param1 (string): The original probe id field. If it is composed by more than one field, put all of them separated with a |. For example X|Y

param2 (bool): If True (or 1 or a non-empty string) the probe information (sequence) will be added

parsing_scripts.platform.cdf_platform module
parsing_scripts.platform.cdf_platform.main()

Parse a CDF file (Affymetrix) and extract PLATFORM information

Looks for probe set name and probe id. Please note that CDF does not contain probe sequence, for that information refer to cdf_platform_fasta.py

PARAMETERS:
None
parsing_scripts.platform.cdf_platform_fasta module
parsing_scripts.platform.cdf_platform_fasta.main()

Parse a FASTA file containing probe sequences

This script is usually used before cdf_platform.py in order to get the probe sequence information that a CDF file doesn’t provide.

PARAMETERS:
None
parsing_scripts.platform.csv_platform module
parsing_scripts.platform.csv_platform.main()

Parse a CSV file containing probe sequences

A CSV file containing probe information is parsed and probes get added to the platform. This script is usually used together with other PLATFORM scripts

PARAMETERS:

param1 (string): The probe id field

param2 (string): The probe sequence

parsing_scripts.platform.gpr_platform module
parsing_scripts.platform.gpr_platform.main()

Parse a GPR file containing PLATFORM information and probe sequences

A GPR file is a TAB-delimited file with headers and complete platform information (descriptions and probe sequences)

PARAMETERS:

param1 (int): Number of lines to skip

param2 (string): The column header to parse out the original probe id field. If it is composed by more than one field, put all of them separated with a |. For example X|Y (actual probe ids will be concatenated with dots . in that case)

param3 (string): The column header of the probe sequence you want to parse out

param4 (string): DEPRECATED - The column header to parse out the DB ‘gene_map_content’ field; if multiple seperate with a pipe | (actual probe ids will be concatenated with dots . in that case)

param5 (string): The column header to parse out probe name field. If it is composed by more than one field, put all of them separated with a |. For example X|Y (actual probe ids will be concatenated with dots . in that case)

param6 (string): The column header to parse out probe set name field

param7 (bool): Ensure that orgiginal probe id in SAMPLE_OBJECT will be unique (defaults to False)

parsing_scripts.platform.ndf_platform module
parsing_scripts.platform.ndf_platform.main()

Parse a NDF file containing probe sequences

A NDF file is an ArrayExpress file that contains probe sequences. They have a header file with X and Y position for the probe, the SEQUENCE field and a PROBE_ID field. The combination of X.Y is used to store the probe id and ensure that is a unique name

PARAMETERS:
param1 (int): Number of lines to skip
parsing_scripts.platform.soft_platform module
parsing_scripts.platform.soft_platform.main()

Parse a SOFT file and extract PLATFORM information

Looks for accession number, platform name, platform type and platform description. If True is passed as parameter it will look for probe sequence information in the data table part of the file

PARAMETERS:
param1 (bool): Read the data table information (default False)
Module contents
parsing_scripts.sample package
Submodules
parsing_scripts.sample.cel_sample module
parsing_scripts.sample.cel_sample.main()

Parse a CEL (Affymetrix) file and extract SAMPLE raw data

The probe original id is given by X.Y

PARAMETERS:
None
parsing_scripts.sample.gpr_sample module
parsing_scripts.sample.gpr_sample.main()

Parse a GPR file and extract SAMPLE raw data

A GPR file is a TAB-delimited file with headers and complete sample raw data information

PARAMETERS:

param1 (string): The column header of the original probe id to parse out. If it is composed by more than one field, put all of them separated with a |. For example X|Y (actual probe ids will be concatenated with dots . in that case)

param2 (string): The column header of the data value you want to parse out

param3 (int): Number of lines to skip

param4 (int): The sample channel (optional)

parsing_scripts.sample.pair_sample module
parsing_scripts.sample.pair_sample.main()

Parse a PAIR file and extract SAMPLE raw data

A PAIR file is a TAB-delimited file with headers and complete sample raw data information The probe id is given by X.Y to ensure uniqueness and the raw data value is taken from the PM column

PARAMETERS:
param1 (int): Number of lines to skip
parsing_scripts.sample.soft_sample module
parsing_scripts.sample.soft_sample.main()

Parse a SOFT file and extract SAMPLE description and optionally raw data

PARAMETERS:
param1 (string): The raw data value field, if empty it will be assigned automatically using the sample_column_identifier function
parsing_scripts.sample.trim_quantify module
parsing_scripts.sample.trim_quantify.main()

Trim a FASTQ file using Trimmomatic and quantify using KALLISTO

The result counts will be added to the SAMPLE OBJECT

PARAMETERS:
param1 (bool): True if this FASTQ file has a PAIRED file (forward or reverse), default False
Module contents
parsing_scripts.utils package
Submodules
parsing_scripts.utils.column_identifier module
parsing_scripts.utils.column_identifier.sample_column_identifier(query, header)

Tries to automatically identify the header column that contains the raw data given some query information (like the dye color)

Multi-channel array might have different dye color on different samples (dye-swap) and thus it would be tedious to manually define it for each single sample. This function tries to do it for you and is tipically invoked for the SOFT sample files.

PARAMETERS:

query (string): The query string is usually something that contains information about the color i.e. cy3, red, green etc.

header (list): The header is a list of string from which to chose one that will match the query

parsing_scripts.utils.rnaseq module
parsing_scripts.utils.rnaseq.create_fasta(file, compendium)

Create a FASTA file using the BIOLOGICAL FEATURE of the current Organism

PARAMETERS:

file (string): The output FASTA file name

compendium (string): The organism (nick) name

Module contents

Module contents

COMMAND>_ for developers

In order to add new features to COMMAND>_ you’ll need to stick with the whole framework. As a demonstration we will create a basic page to retrieve some data from the database and show them in a grid within COMMAND>_. So we will take care of:

  • create the ExtJS interface;
  • create the Python view;
  • create the permission to access the view;
  • make an AJAX call passing parameters;
  • perform a job on celery to run in background;
  • handle websocket to show the results on a grid;

We will also see how to extend COMMAND>_ functionalities such as how to add a new public database users can use to perform search on, how to add a new platform type and so on.

Note

For anything else related to the interface design please refer to the ExtJS documentation. While to properly add new models and extend the Data Model, please refer to the Django documentation

Add brand new feature in COMMAND>_

Create the ExtJS interface

COMMAND>_ is a single-page application, so everything you see runs within one HTML file and the Javascript code needed to display the interface is loaded and managed by the ExtJS framework. All ExtJS interface files (views) live within the directory command/static/command/js/ext-js/app/view. So let’s create a test directory in here and, within that directory let’s create 2 files: Test.js and TestController.js.

Let’s fill these two files with some basic code like the following:

// Test.js

Ext.define('command.view.test.Test', {
  extend: 'Ext.Component',

  xtype: 'test',

  title: 'Test',

  requires: [
      'Ext.panel.Panel',
      'command.view.test.TestController'
  ],

  controller: 'test',

  store: null,

  alias: 'widget.test',

  itemId: 'test',

  reference: 'test',

  viewModel: {},

  html: 'TEST',

  listeners: {
      //
  },

  initComponent: function() {
      this.callParent();
  },

  destroy: function() {
      this.callParent();
  }
});
// TestController.js

Ext.define('command.view.test.TestController', {
  extend: 'Ext.app.ViewController',

  alias: 'controller.test'
});

Now you will need to run the command sencha app build from within the command/static/command/js/ext-js directory.

Note

To use the sencha app build command you will need to download and install Sencha CMD

Now you should be able to point your browser to http://localhost/#view/test and see that the Test panel has been correctly loaded as a tab within the main application panel. To make it reachable with a button and to add a small icon next to the tab name we should edit two files, Main.js (here) and Application.js (here).

100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
// Main.js
// Add the ``Test`` menu button

},{
    text: 'Test',
    itemId: 'test_menu_item',
    iconCls: null,
    glyph: 'xf11b',
    listeners: {
        click: {
            fn: 'onAction',
            hash: 'view/test',
            glyph: 'xf11b',
            panel: 'test'
        }
    }
},{
    text: 'Options',
    ...
34
35
36
37
38
39
40
41
// Application.js
// Add the ``test`` glyph

version: null,

panel_glyph: {
    'test': 'xf11b',
    ...

You should see something like the following:

_images/screenshot_test.png

Create the Python View code

Now let’s create a grid, a basic double-click event and a link to a Python view. First of all we need to create the test.py file within the views directory (here). The basic view file should look something like that:

 // test.py

 import json
 from django.http import HttpResponse
 from django.views import View
 from command.lib.utils.decorators import forward_exception_to_http


class TestView(View):

   def get(self, request, operation, *args, **kwargs):
      method = getattr(self, operation)
      return method(request, *args, **kwargs)

   def post(self, request, operation, *args, **kwargs):
      method = getattr(self, operation)
      return method(request, *args, **kwargs)

   @staticmethod
   @forward_exception_to_http
   def test(request, *args, **kwargs):

     return HttpResponse(json.dumps({'success': True}),
                         content_type="application/json")

The test function does nothing at the moment and is meant to respond to an Ajax call. We’ll see that within the same TestView class we will put both code to manage Ajax and WebSocket requests. Before we add any business logic code we need to tell COMMAND>_ that the ExtJS view test will make requests to the Python view TestView and that users need no specific privileges to do that (for the moment). So let’s add one line in the consumer.py script (here):

34
35
36
37
38
39
40
41
# consumer.py

class Dispatcher:
   dispatcher = {
      ...
      ExportDataView: ['export_data'],
      TestView: ['test']
   }

Add a grid to the ExtJS interface

So far, so good. Let’s remove the HTML code from the Test.js file and let’s add a grid to show all the experiments for the selected compendium. The file will now look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
// Test.js

Ext.define('command.view.test.Test', {
   extend: 'command.Grid',

   xtype: 'test',

   title: 'Test',

   requires: [
     'Ext.panel.Panel',
     'command.view.test.TestController'
   ],

   controller: 'test',

   store: null,

   alias: 'widget.test',

   itemId: 'test',

   reference: 'test',

   viewModel: {},

   mixins: {
      getRequestObject: 'RequestMixin'
   },

   command_view: 'test',

   command_read_operation: 'test_read',

   listeners: {
     //
   },

   columns: [{
     text: 'Accession',
     flex: 2,
     sortable: true,
     dataIndex: 'experiment_access_id',
   }, {
     text: 'Experiment name',
     flex: 2,
     sortable: true,
     tdCls: 'command_tooltip',
     dataIndex: 'experiment_name'
   }, {
     text: 'Scientific paper',
     flex: 2,
     sortable: true,
     dataIndex: 'scientific_paper_ref'
   }, {
     text: 'Description',
     flex: 2,
     sortable: true,
     tdCls: 'command_tooltip',
     dataIndex: 'description'
   }],

   initComponent: function() {
     this.store = Ext.create('command.store.Experiments');
     this.callParent();
   },

   destroy: function() {
     this.callParent();
   }
});

Please note that:

  • at line 4 we extend command.Grid;
  • at line 31 we are saying to COMMAND>_ the view to be used;
  • at line 33 we are declaring the default read operation (i.e. the default Python function to be called);
  • at line 64 we are declaring the ExtJS store to use.

Create the Ajax call on double-click event

// Test.js

listeners: {
   itemdblclick: 'onTestDoubleClick'
},

Then we’ll need to implement the onTestDoubleClick in the TestController.js

// TestController.js

onTestDoubleClick: function(dv, record, item, index, e) {
     var grid = dv.up('grid');
     var gridSelection = grid.getSelection();
     var request = grid.getRequestObject('test');
     request.values = JSON.stringify(gridSelection[0].data);
     Ext.Ajax.request({
         url: request.view + '/' + request.operation,
         params: request,
         success: function (response) {
             command.current.checkHttpResponse(response);
         },
         failure: function (response) {
             console.log('Server error', reponse);
         }
     });
 }

Manage asynchronous code using Celery and WebSocket

The request object is configured to automatically retrieve the view name (request.view) and setted to call the test function in the Python TestView.

# test.py

 @staticmethod
 @forward_exception_to_http
 def test(request, *args, **kwargs):
     values = json.loads(request.POST['values'])

     comp_id = request.POST['compendium_id']
     channel_name = request.session['channel_name']
     view = request.POST['view']
     operation = request.POST['operation']

     test.test_task.apply_async(
         (request.user.id, comp_id, values['id'], channel_name, view, operation)
     )

     return HttpResponse(json.dumps({'success': True}),
                         content_type="application/json")

With the test.test_task.apply_async we are calling the test_task function from the Celery task file test.py (not to be confused with the Python view file that have the same name). We need to create this file and implement the functionality. So let’s create a file name test.py in the command/command/lib/task directory (here). The file will look like that:

# test.py

from __future__ import absolute_import, unicode_literals
from time import sleep
import celery
from channels import Channel
from command.lib.utils.message import Message


class TestCallbackTask(celery.Task):
    def on_success(self, retval, task_id, args, kwargs):
        user_id, compendium_id, path, channel_name, view, operation = args
        channel = Channel(channel_name)
        message = Message(type='info', title='Hello world!',
                          message='Hi there!'
                          )
        message.send_to(channel)

    def on_failure(self, exc, task_id, args, kwargs, einfo):
        pass


@celery.task(base=TestCallbackTask, bind=True)
def test_task(self, user_id, compendium_id, exp_id, channel_name, view, operation):
    sleep(1)

The test_task function simply wait for one seconds. When it’s done the on_success callback function gets called and it retrieve the WebSocket channel name to send back a simple message. That message will be captured on the client side and a pop-up will appear. Before trying it out we need to inform Celery that there’s an extra file to search for when calling a task. This is done in the Django setting file, here.

# settings.py

CELERY_IMPORTS = (
   'command.lib.tasks.experiment_public',
   'command.lib.tasks.experiment_local',
   'command.lib.tasks.uncompress_file',
   'command.lib.tasks.run_file_assignment_script',
   'command.lib.tasks.run_parsing_script',
   'command.lib.tasks.parse_bio_feature_file',
   'command.lib.tasks.run_platform_mapper',
   'command.lib.tasks.import_experiment',
   'command.lib.tasks.import_platform_mapping',
   'command.lib.tasks.export_data',
   'command.lib.tasks.test',
)

You should now be able to double-click on a grid value and see something like this.

_images/screenshot_test_3.png

Add new public database manager

At the moment of writing, COMMAND>_ is able to search on GEO, ArrayExpress and SRA.

_images/screenshot_test_4.png

To add a new database on this list, you will need to add a line in a database table and extend one class. In the command_datasource database table you should add the source name and the class to handle it.

_images/screenshot_test_5.png

The class should be defined extending the class PublicDatabase that is defined here. This is an abstract class and to extend it you will need to implement three methods:

  • search: it perform the actual search on the public database (through a REST call or FTP for example) and create one ExperimentSearchResult for each retrieved entry to be stored in the database;
  • download_experiment_files: it is responsible to get all the data files related to one single ExperimentSearchResult and save them in the output directory;
  • create_experiment_structure: starting from the information of the downloaded files, this method should create the experiment, platform, sample structures and save it using Experiment, Platform and Sample Django models.

Add new compendium type

This is by far the easiest thing to do since it’s just a matter of adding one tuple on the command DB. The table to be modified is the command_compendiumtype table. At the moment the only compendium type defined is the gene expression one. The fields are name, description and the biological feature name, so respectively gene_expression, Gene expression compendium and gene.

_images/screenshot_db_compendium.png

Add new biological feature file importer

All the classes releated to importing biological features are located here. First thing to do is to inform the dispatcher in the importers.py file which are the classes responsible to manage different file types. For example, genes will be imported using FASTA files. The second step is to actually implement the class extending the BaseImporter class. The newly defined class will need to implement the parse method and redefine the FILE_TYPE_NAME variable.

# fasta_file_importer.py

class FastaFileImporter(BaseImporter):
     FILE_TYPE_NAME = 'FASTA'

     def parse(self, filename):
         sequence_field = BioFeatureFields.objects.using(self.compendium).get(name='sequence')
         with transaction.atomic(using=self.compendium):
             with open(filename, 'rU') as handle:
                 for record in SeqIO.parse(handle, 'fasta'):
                     gene = BioFeature()
                     gene.name = record.id
                     gene.description = record.description
                     gene.save(using=self.compendium)
                     bf_value = BioFeatureValues()
                     bf_value.bio_feature = gene
                     bf_value.bio_feature_field = sequence_field
                     bf_value.value = str(record.seq)
                     bf_value.save(using=self.compendium)
_images/screenshot_fasta_importer.png

Add new platform type

To add a new platform type there are several step to do and mostly depends on the kind of platform is going to be added.

Database entry To add a new platform type for a single compendium (organism) you will need to add a tuple with name, description, bio feature reporter name and the compendium type ID, for example: microarray, MicroArray, probe and 1 to the command_platformtype table. If you want every new compendium you are going to create to have such new platform you will need to add the same tuple to the command_platformtypeadmin table in the command DB.

_images/screenshot_plt_type.png

Reporters ExtJS GUI Next step will be to inform the GUI how to behave when the user wants to see the biological feature reporters associated with the new platform. For example in case of Microarray the biological feature reporters are the probes. The file to modify is PlatformController.js (defined here). onViewBioFeatureReporter is the method to modify adding a new case for the new platform. For example in case of RNA-seq we simply display a message to say there’s no associated biological feature reporters since the gene expression measurement in this case is directly given by read counts. For Microarray instead we have probes and thus we will open a new window to show the probes associated with this platform, the window_bio_feature_reporter window.

// PlatformsController.js

onViewBioFeatureReporter: function (me) {
    var selection = me.up('grid').getSelectionModel().getSelection()[0].data;
    var comp = JSON.parse(localStorage.getItem("current_compendium"));
    if (selection.platform_type) {
        switch (selection.platform_type.name) {
            case 'rnaseq':
                Ext.MessageBox.show({
                    title: 'RNA-seq platform',
                    msg: 'For RNA-seq platform ' + selection.platform_access_id +  ', ' +  comp.compendium_type.bio_feature_name + ' is/are directly measured',
                    buttons: Ext.MessageBox.OK,
                    icon: Ext.MessageBox.INFO,
                    fn: function (a) {
                    }
                });
                break
            case 'microarray':
                var win = Ext.create({
                    xtype: 'window_bio_feature_reporter',
                    title: 'Microarray platform ' + selection.platform_access_id + ': ' +
                        comp.compendium_type.bio_feature_name + ' feature reporters (' + selection.platform_type.bio_feature_reporter_name + ')',
                    platform: selection
                });
                break
        }
    }

Add new platform mapper

When a platform has biological feature reporters associated, these must be mapped to the biological features. In case of gene expression compendium the biological features are genes. So to give a concrete example we will need to associate Microarray probes to genes. This step is very platform-dependant and so a lot of freedom is left to the developer to design the GUI. There are just few things to keep in mind in order to have everything working correctly within the COMMAND>_ framework.

Mapper ExtJS GUI First thing will be to inform the GUI how to behave when the user wants to map this platform reporters to the biological features. The file to modify is again the PlatformController.js (defined here), but this time we are going to modify the onMapPlatformToBioFeature method, adding a new case for the new platform. For Microarray we defined a new window window_map_microarray_platform here. Again, in this case the developer is left completely free to design it as he wants.

// PlatformsController.js

onMapPlatformToBioFeature: function (me) {
    var selection = me.up('grid').getSelectionModel().getSelection()[0].data;
    var comp = JSON.parse(localStorage.getItem("current_compendium"));
    if (selection.platform_type) {
        switch (selection.platform_type.name) {
            case 'rnaseq':
                Ext.MessageBox.show({
                    title: 'RNA-seq platform',
                    msg: 'RNA-seq platform ' + selection.platform_access_id +  ' is automatically mapped to ' + comp.compendium_type.bio_feature_name,
                    buttons: Ext.MessageBox.OK,
                    icon: Ext.MessageBox.INFO,
                    fn: function (a) {
                    }
                });
                break
            case 'microarray':
                command.current.createWin({
                    xtype: 'window_map_microarray_platform',
                    title: 'Map microarray platform ' + selection.platform_access_id + ' to ' + comp.compendium_type.bio_feature_name,
                    platform: selection
                });
                break
        }
    }

Mapper Django View The associated Django View is defined in platform.py view file here and for Microarray this is the MicroarrayPlatformView class. This is pretty standard view as described previously.

Mapper code The actual code is stored in a class that will extend the BaseMapper (placeholder) class. For Microarray this class is MicroarrayMapper and is located here. Last step is to inform the mapper dispatcher on which class to invoke, and this is done in the mappers.py file located here.

// mappers.py

from command.lib.coll.platform.microarray_mapper import MicroarrayMapper

platform_mapper = {
    'microarray': MicroarrayMapper
}

Contribute & Support

Use the GitHub Push Request and/or Issue Tracker.

Author

To send me an e-mail about anything else related to COMMAND>_ write to email

License

The project is licensed under the GPLv3 license.

How to cite

If you find COMMAND>_ useful for your work please cite

Moretto, M., Sonego, P., Villaseñor-Altamirano, A. B., & Engelen, K. (2019). First step toward gene expression data integration: transcriptomic data acquisition with COMMAND>_. BMC bioinformatics, 20(1), 54. ISO 690

https://doi.org/10.1186/s12859-019-2643-6