spectra-cluster-py - Analysing clustering results

Welcome

The spectra-cluster-py project is a collection of tools and APIs that help analysing and working with MS/MS spectrum clustering results in the .clustering format.

The .clustering format

The .clustering format is currently used by the spectra-cluster algorithm (and API) and the output format of the derived tools:

An up-to-date documentation of the .clustering format can be found at the Java API clustering-file-reader project page.

Tools

The spectra-cluster-py project contains a set of end-user ready tools to analyse MS/MS clustering results in the .clustering format.

The id_transferer_cli transfers identification data to unidentified spectra. This can be used to improve the accuracy of label-free quantitation data.

The clustering_stats creates simply tab-delimited files with basic Q/C measurements of the clustering results. This is only possible if identification data is present in the .clustering file as these are used as gold-standard (see id_transferer_cli).

The cluster_features_cli creates a matrix with the input MGF file names as column headers and the clusters as rows. Each cell contains the number of spectra per file and cluster. This can be used, for example, to run a principal component analysis of the input files based on the clusters.

The protein_annotator can map peptides in a text file to proteins from a fasta file. Additionally, basic protein inference can be performed.

The cluster_result_comparator can be used to compare two clustering result (in the .clustering format). The comparison is performed by creating a network representation where clusters are nodes and edges are created based on shared spectra. If Cytoscape is running before the script is launched, the network is automatically displayed in Cytoscape.

The complete list of tools can be found here.

In the python package the source code of these tools is at spectra_cluster.ui.

Python API

This collection of classes is intended to help you develop your own scripts to analyse MS/MS clustering results in the .clustering format.

You can find the complete API documentation here.

Tools

The spectra-cluster-py project provides a set of end-user tools to process .clustering files.

consensus_spectrum_exporter

consensus_spectrum_exporter.py

This tool exports the consensus spectra of a clustering result file into the specified format.

Usage:
consensus_spectrum_exporter.py –input=<results.clustering> –output=<spectra.mgf> [–format=<MGF>]
[–min_size=<size>] [–max_size=<size>] [–min_ratio=<ratio>] [–max_ratio=<ratio>] [–min_identified=<spectra>] [–max_identified=<spectra>]
consensus_spectrum_exporter.py –cluster_ids=<ids.txt> –input=<results.clustering> –output=<spectra.mgf>
[–format=<MGF>]

consensus_spectrum_exporter.py (–help | –version)

Options:
-i, --input=<clustering file>
 Path to the .clustering result file to process.
-o, --output=<features.txt>
 Path to the output file that should be created. The output will be formatted as a tab-delimited text file.
-f, --format=<MGF>
 The output format to use. Currently only “MGF” is supported [default: MGF]
--cluster_ids=<ids.txt>
 If this parameter is set, the cluster ids are read from the specified file (one id per line) and only these clusters will be exported. All other filtering parameters are ignored if this parameter is set.
--min_size=<size>
 The minimum size of a cluster to be reported.
--max_size=<site>
 The maximum size of a cluster to be reported.
--min_ratio=<ratio>
 The minimum ratio a cluster must have to be reported.
--max_ratio=<ratio>
 The maximum ratio a cluster must have to be reported.
--min_identified=<spectra>
 May specify the minimum number of identified spectra a cluster must have.
--max_identified=<spectra>
 May specify the maximum number of identified spectra a cluster must have.
-h, --help Print this help message.
-v, --version Print the current version.

cluster_parameter_extractor

cluster_parameter_extractor

Extracts basic parameters about the clusters found in a .clustering result file and writes these parameters to a tab-delimited file.

Usage:
cluster_parameter_extractor.py –input=<results.clustering> –output=<parameters.txt>
[–synthetic_peptides]

cluster_parameter_extractor.py (–help | –version)

Options:
-i, --input=<clustering file>
 Path to the .clustering result file to process.
-o, --output=<parameters.txt>
 Path to the output file that should be created.
--synthetic_peptides
 If this option is specified, all spectra from the dataset on synthetic peptides (PXD004732) will be analysed separately.
-h, --help Print this help message.
-v, --version Print the current version.

cluster_spectra_extractor

cluster_spectra_extractor

This tool extracts the spectra of a specified cluster from the original source peak list files and writes all of a cluster’s spectra in a single MGF file.

Usage:
cluster_spectra_extractor –output_directory=</path/to/results> –clustering_file=<result.clustering>
[–add_consensus_spectrum] –peaklist_directory=</path/to/dir>… <cluster_id>…

cluster_spectra_extractor –build_index <mgf_files>… cluster_spectra_extractor (–help | –version)

Options:
-o, --output_directory=</path/to/results>
 Path to the directory where the MGF files will be created. The files will have the cluster’s id as a name.
-c, --clustering_file=<result.clustering>
 Path to the .clustering result file.
-p, --peaklist_directory=</path/to/dir>
 Path to a directory holding the original MGF files. Multiple directories can be specified by specifying this parameter multiple times.
-i, --build_index
 If set, the passed MGF files are indexed.
--add_consensus_spectrum
 If set, the cluster’s consensus spectrum is written to the MGF file as the first spectrum.
-h, --help Displays this help.
-v, --version Displays the tool’s version.

clustering_stats

clustering_stats

Extracts basic statistics (ie. number of clusters, incorrectly clustered spectra) from .clustering files. This script only creates meaningful results if the .clustering file contains identification data which will be used to evaluate correctly and incorrectly clustered spectra.

Usage:
clustering_stats –output <stats.tsv> –min_size <3> CLUSTERING_FILE… clustering_stats (–help | –usage)
Options:
-o, --output <stats.tsv>
 Name of the result file to be created.
-s, --min_size <3>
 The minimum size a cluster must have to be evaluated [default: 3]
-h, --help Show this help message.
--usage Show usage information

mgf_search_result_annotator

The mgf_search_result_annotator embeds identification data in MGF files to be processed by the spectra_cluster algorithm tool suite.

This tool adds search results to an MGF file by adding the identified peptide sequence as the SEQ= field to MGF files. This identification data is picked up by the spectra-cluster tools and added to the .clustering output files. Modification data is currently omitted (this is not the case in the internal PRIDE Cluster pipeline).

The spectra-cluster tools (ie. the spectra-cluster-cli tool) expect identification data to be embedded in the processed MGF files. Even though this method is unorthodox it significantly simplifies the development of clustering tools as these do not have to worry about the used search engine or search result formats. Additionally, when building the PRIDE Cluster resource we have to rely on this technique since the identification and spectrum data is exported from the PRIDE Archive database.

Usage:
mgf_search_result_annotator.py –input=<spectra.mgf> –search=<search_result.mzid> –output=<annotated_spectra.mgf>
[–format=<MSGF+>] [–fdr=<0.01>] [–decoy_string=<REVERSED>]

mgf_search_result_annotator.py (–help | –version)

Options:
-i, --input=<spectra.mgf>
 The original MGF file to use as input.
-s, --search=<search_result.mzid>
 The path to the search result. Note: The search must have been performed on the input MGF file directly. Otherwise, the matching between identification data and spectra may go wrong.
-o, --output=<annotated_spectra.mgf>
 Path to where the annotated MGF file should be written to.
-f, --format <MSGF+>
 The format of the search results. Possible options are “MSGF+”, “MSGF_ident” (MSGF+ mzIdentML files), “MSAmanda”, “Scaffold”, “XTandem”. [default: “MSGF+”]
-d, --fdr=<0.01>
 Define the FDR by which the input search results are filtered. If the FDR is set to ‘2’ for Scaffold output, the original cut-off is used. [default: 0.01]
--decoy_string=<REVERSED>
 The string to use to identify decoy proteins. [default: REVERSED]
-h, --help Print this help message.
-v, --version Print the current version.

id_transferer_cli

split_moff_file

This small script is necessary to use the moFF output files created by the id_transfer_cli tool together with the moFF toolchain. moFF expects one result file per input file. The id_transferer_cli tool on the other hand creates one result file per clustering file.

The splot_moff_file tool can now be used to create one result file per input file based on a moFF formatted (use the option –moff_compatible) result file from the id_transferer_cli tool.

split_moff_file

Split a MoFF quantification output file based on the MGF input file that the various PSMs originate from.

Usage:
split_moff_file.py –input=<moff_result.txt> –out_dir=</my/dir> split_moff_file.py (–help | –version)
Options:
-i, --input=<moff_result.txt>
 The path to the moff result file. New files will be generated by appending “moff.txt” to the MGF filename.
-o, --out_dir=</my/dir>
 Output directory to save the newly created files in.
-h, --help Print this help message.
-v, --version Print the current version.

cluster_features_cli

cluster_features_cli

Command line interface to the spectra-cluster cluster as features tool. This tool extracts the clusters and samples found in the dataset and exports a table where each sample is represented as a column and each cluster as a row. The cells then contain the number of spectra observed from the given sample in the given cluster.

Usage:
cluster_features_cli.py –input=<results.clustering> –output=<features.txt>
[–min_size=<size>] [–min_ratio=<ratio>] [–min_identified=<spectra>]

cluster_features_cli.py (–help | –version)

Options:
-i, --input=<clustering file>
 Path to the .clustering result file to process.
-o, --output=<features.txt>
 Path to the output file that should be created. The output will be formatted as a tab-delimited text file.
--min_size=<size>
 The minimum size of a cluster to be reported.
--min_ratio=<ratio>
 The minimum ratio a cluster must have to be reported.
--min_identified=<spectra>
 May specify the minimum number of identified spectra a cluster must have.
-h, --help Print this help message.
-v, --version Print the current version.

protein_annotator

protein_annotator

This tool adds a protein accession column to a text file based on the present peptides.

The column containing the peptide string must be defined. Then, all proteins that match the given peptide are written to a specified column.

Usage:
protein_annotator.py –input=<input.tsv> –output=<extended_file.tsv> –fasta=<fasta_file.fasta>
[–peptide_column=<column_name>] [–protein_column=<column_name>] [–protein_separator=<separator>] [–column_separator=<separator>] [–ignore_il]

protein_annotator.py (–help | –version)

Options:
-i, --input=<input.tsv>
 Path to the input file.
-o, --output=<extended_file.tsv>
 Path to the output file that should be created.
-f, --fasta=<fasta_file.fasta>
 Fasta file to match the peptides to.
--peptide_column=<column_name>
 Column name of the peptide column [default: sequence]
--protein_column=<column_name>
 Column name of the newly added protein column [default: protein]
--protein_separator=<separator>
 Separator to separate multiple protein entries [default: ;]
--column_separator=<separator>
 Separator to separate columns in the file [default: TAB]
--ignore_il If set I/L are treated as synonymous.
-h, --help Print this help message.
-v, --version Print the current version.

unique_fasta_extractor

unique_fasta_extractor

This tool simply removes all duplicate protein entries (based on the sequence) from a given FASTA file.

Usage:
unique_fasta_extractor.py –input=<original.fasta> –output=<unique.fasta>
Options:
-i, --input=<original.fasta>
 Path to the FASTA file to process.
-o, --output=<unique.fasta>
 Path to the newly created unique FASTA file
-h, --help Print this help message.
-v, --version Print the current version.

cluster_result_comparator

This tools compares two clustering results with each other.

The core comparison is done by creating a network representation of the clustering results. Every cluster in each of the files is represented as a node. Spectra present in both clusters create the edges.

The network is rendered in Cytoscape. For this to work Cytoscape must be opened before launching the script. Additionally, the resulting network is stored as a GraphML file.

Usage:
cluster_result_comparator.py –output <network.xml> <result1.clustering> <result2.clustering>
Options:
-o, --output=<network.xml>
 Path to the network file (GraphML format) to be created.
-h, --help Print this help message.
-v, --version Print the current version.

API Documentation

The spectra-cluster-py project provides a set of APIs that support the development of own scripts to work with MS/MS clustering results in the .clustering format.

Core

The core of the API is made up of a set of common objects that represent the clustering results and the Clustering Parser used to parse the .clustering files.

Analysers

The functionality of all end-user tools is represented through the analysers. Each analyser implements the AbstractAnalyser class.

Exporter

The exporter classes implement the AbstractAnalyser class but convert the clusters’ consensus spectra into different file formats.

API Class List

Objects

These are the common objects used to represent clustering data throughout all classes.

class spectra_cluster.objects.Cluster(cluster_id, precursor_mz, consensus_mz, consensus_intens, spectra, ignore_duplicated=True)

Represents a cluster in a .clustering output file.

__init__(cluster_id, precursor_mz, consensus_mz, consensus_intens, spectra, ignore_duplicated=True)

Creates a new cluster object

Parameters:
  • cluster_id – The cluster’s id
  • precursor_mz – The cluster’s average precursor m/z
  • consensus_mz – A list of doubles holding the consensus spectrum’s m/z values
  • consensus_intens – A list of doubles holding the consensus spectrum’s intensity values
  • spectra – A set of spectra associated with the cluster
  • ignore_duplicated – The constructor automatically removes duplicated spectra from the list of clustered spectra. If duplicated spectra are found, an Exception is raised. Setting this parameter to true does not prevent the filtering, but prevents the exception to be raised.
static calculate_sequence_counts(spectra, ignore_i_l=False)

Calculates the sequence counts based on the passed spectra. PTMs are ignored for this assessment

Parameters:
  • spectra – The spectra to derive the sequence counts from.
  • ignore_i_l – If set I and L are treated as equivalent. If set all I are replace by L and the sequences in the returned map may not correspond to the originally identified sequences.
Returns:

A dict with a sequence as key and the number of occurrences as value.

get_spectra()

Returns the stored spectra in a tuple. These object should not be changed. Otherwise, the cluster’s statistics may no longer be accurate.

Returns:A tuple containing the cluster’s spectra
set_spectra(new_spectra)

Updates the cluster’s stored spectra

Parameters:new_spectra – A list of PSM objects.
class spectra_cluster.objects.PSM(sequence, ptms)

Defines a peptide-spectrum-match

__init__(sequence, ptms)

Creates a new PSM object.

Parameters:
  • sequence – The sequence associated with the PSM.
  • ptms – A set of PTMs
Returns:

class spectra_cluster.objects.PTM(position, accession)

Defines a post-translational modification within a peptide

Variables:
  • position – The PTM’s position within the peptide string
  • accession – The PTM’s accession in UNIMOD (if starting with “MOD:”). This may also represent a PSI entry in the format [PSI-MS, MS:1001524, fragment neutral loss, 63.998283]
__init__(position, accession)

Creates a new PTM object

Parameters:
  • position – 1-based position within the peptide (0 for terminus)
  • accession – MOD accession of the modification.
Returns:

class spectra_cluster.objects.Spectrum(title, precursor_mz, charge, taxids, psms, similarity_score=0, json_properties=None)

A spectrum reference.

__init__(title, precursor_mz, charge, taxids, psms, similarity_score=0, json_properties=None)

Creates a new Spectrum reference.

Parameters:
  • title – The spectrum’s title.
  • precursor_mz – Measured precursor m/z
  • charge – Charge state
  • taxids – Set of taxids of the experiments in which the spectrum was observed
  • psms – A set of psms associated with the spectrum. If None is passed

an empty set is created. :param similarity_score: The similarity of this spectrum with the cluster’s consensus spectrum. :param json_properties: Additional properties of the spectrum encoded as a JSON string. :return:

get_clean_sequence_psms()

Returns all PSMs with all special characters removed from the sequences.

Returns:A tuple of PSMs
get_clean_sequences()

Returns the identified sequences without any additional characters and only using high-caps.

Returns:Identified sequences
get_filename()

The originally filename can optionally be encoded in the title string. If present this filename is returned otherwise None

Returns:Original filename or None if not present
get_id()

The spectrum’s id can optionally be encoded in the title string. If present this id is returned, otherwise None.

Returns:Original spectrum id or None if not present
get_mass()

Calculates the molecular mass based on the precursor_mz and the charge.

Returns:The molecular mass
get_property(key)

Get the property with the defined key.

Parameters:key – The property’s name.
Returns:The property’s value or None if it is not defined
get_title()

Optionally the spectrum’s filename and id can be encoded in the title string. If this is the cases, the original title is extracted from the string. If no fields were encoded, the whole title string is returned. Therefore, this function should always be used if the reader expects to access the original spectrum’s title.

Returns:The original spectrum’s title
is_identified()

Checks whether the spectrum was identified.

Returns:boolean

Clustering Parser

Usage

The clustering parser provides a simple python iterator:

import spectra_cluster.clustering_parser as clustering_parser

clustering_file = "test.clustering"
parser = clustering_parser.ClusteringParser(clustering_file)

for cluster in parser:
     # do something with the cluster
     pass
Class Definition
class spectra_cluster.clustering_parser.ClusteringParser(clustering_file)

Parses .clustering output files created by the spectra-cluster applications.

__init__(clustering_file)

Processes the passed .clustering file

Parameters:clustering_file – Path to the file to process
Returns:

Abstract Analyser

Usage

This class provides some common features often required in analysers. Currently, the main feature is the filtering of clusters based on size, ratio, identified spectra, etc.

If you develop an analyser simply create a child of this class:

from spectra_cluster.analyser.common import AbstractAnalyser

class MyAnalyser(AbstractAnalyser):
    """
        This Analyser counts the total number of
        clusters
    """
    def __init__(self):
        # call the AbstractAnalyser's init class
        # to set all filtering variables to their
        # default values.
        super().__init__()

        # initialise you're class' instance
        # variables            self.number_of_clusters = 0

    def process_cluster(self, cluster):
        # This function must be implemented by all
        # analysers

        # first, check whether this cluster should
        # be processed at all
        if self._ignore_cluster(cluster):
            return

        # count the cluster
        self.number_of_clusters += 1
Class Definition
class spectra_cluster.analyser.common.AbstractAnalyser

Base class for all analysers.

This class mainly provides helper functions for filtering clusters based on size, ratio, identified spectra, etc.

Additionally, every child class must implement the process_cluster(Cluster) function.

Its init functions sets the following member variables so that all clusters are accepted.

Variables:
  • min_size – Minimal size of the cluster
  • max_size – Maximum size of the cluster
  • min_ratio – Minimum I/L agnostic ratio
  • max_ratio – Maximum I/L agnostic ratio
  • min_identified_spectra – Minimal identified spectra
  • max_identified_spectra – Maximum identified spectra
  • min_unidentified_spectra – Minimal unidentified spectra
  • max_unidentified_spectra – Maximum unidentified spectra
__init__()

Initialises the default parameters to filter clusters.

_ignore_cluster(cluster)

Tests whether the passed cluster should be ignored

This function assess the class’ instance variables (min size, max size, etc.) to test whether the passed cluster should be ignored for further processing.

Param:cluster Cluster to test.
Returns:Boolean indicating whether the cluster should be ignored.
process_cluster(cluster)

Processes the defined cluster.

Parameters:cluster – Cluster to process.

IdTransferer

Class Definition
class spectra_cluster.analyser.id_transferer.IdTransferer(add_to_identified=False, add_to_unidentified=True, include_all_identified=False)

The IdTransferer analyser transfers identifications to spectra part of a cluster.

The analysis is run by calling ‘process_cluster’ repeatedly. The transferred identifications are stored in the ‘identification_references’ member variable.

Variables:identification_references – A list of IdentificationReferences
__init__(add_to_identified=False, add_to_unidentified=True, include_all_identified=False)

Creates a default IdTransferer object.

Parameters:
  • add_to_identified – If set identifications are added to identified spectra.
  • add_to_unidentified – If set identifications are added to unidentified spectra.
  • include_all_identified – If set identified spectra that are not part of reliable clusters are returned as well. Additionally, if add_to_identified is set to false and include_all_identified is set to true, the original identifications are returned unchanged.
static extract_main_cluster_psms(cluster)

Extracts the most common identification(s) from a cluster and returns it as a PSM object.

Parameters:cluster – Cluster to extract the most common PSM from.
Returns:Most common PSMs as a list
process_cluster(cluster)

Transfers ids to spectra based on the cluster’s properties

Parameters:cluster – The cluster to process
IdentificationReference
class spectra_cluster.analyser.id_transferer.IdentificationReference(filename, spec_id, psms, changed_through_clustering, spectrum)

Contains the main identification data.

Variables:
  • filename – The original peak list filename
  • spec_id – The spectrum’s id within the source file
  • psms – A list of PSM objects
  • changed_through_clustering – Logical indicating whether the identification details were changed through the clustering.
  • spectrum – The complete spectrum object
__init__(filename, spec_id, psms, changed_through_clustering, spectrum)

Creates a new instance of the identification reference.

Parameters:
  • filename – Original peak list filename.
  • spec_id – The spectrum’s id within this file.
  • psms – A list of PSM objects
  • changed_through_clustering – Logical indicating whether the identification details were changed through the

ClusterAsFeatures

Class Definition
class spectra_cluster.analyser.cluster_features.ClusterAsFeatures(result_file, sample_name_extractor=None)

Extracts the number of spectra per sample and cluster and writes the result directly to a file object.

Since the number of samples within the clustering result is not known at the beginning, you have to use the function “add_resultfile_header” to add a header to the result file.

__init__(result_file, sample_name_extractor=None)

Initialised a new ClusterAsFeatures analyser.

Parameters:
  • result_file – A file object that will be used to write the resulting table to. This is necessary since this result data will generally be too large to keep in memory.
  • sample_name_extractor – A function that takes the spectrum’s title as parameter and returns the ie. ample name. If set to None the default function is used where everything before the first “.” is being returned.
add_resultfile_header(file_path)

Adds the header line to the result file that was used to write the results during the analysis to.

Parameters:file_path – Path ot the file where the results are stored.
static extract_basic_sample_name(spec_ref)

Extracts the sample name by returning everything before the first “.” from the title (often used by ProteoWized converted files) or, if available, by returning the original filename (without path information).

Parameters:spec_ref – The spectrum object.
Returns:The sample name
process_cluster(cluster)

Extracts how many spectrum per sample were observed.

Parameters:cluster – The cluster to process
Returns:

Exporter Package

The exporter package contains a set of “analysers” that convert the cluster’s consensus spectra into a peak list format.

mgf_exporter
class spectra_cluster.analyser.exporter.mgf_exporter.MgfExporter(result_file)

Converts the clusters’ consensus spectra into MGF format.

__init__(result_file)

Initialises a new MgfExporter object.

Parameters:result_file – File object to write to.
process_cluster(cluster)

Convert the cluster’s consensus spectrum into MGF format.

Parameters:cluster – The cluster to process