GOparser 1.1¶
GOparser is a Python framework for working with Gene Ontology (GO) terms and annotations. GOparser is free and open-source software (see License).
Main Features¶
- Efficient parsing of GO term information contained in
go-basic.obo
from the Gene Ontology Consortium. - Efficient parsing of GO annotations contained in GAF files from EBI’s GO Annotation Database (UniProt-GOA).
- Filtering for annotations of protein-coding genes (using
genometools
). - Easy and fast retrieval of all genes annotated with a particular GO term, or all GO terms a particular gene is annotated with.
- Annotations are fully propagated based on
is_a
andpart_of
relations between GO terms. - Cross-species support.
- Support for filtering annotations based on evidence code.
Demo Notebooks¶
- Demo.ipynb (
download
): GOparser demo notebook
Missing Features¶
- Visualizations (e.g., to show relationships between GO terms).
- Support for other GO relations, e.g.,
regulates
andhas_part
.
GOparser Modules¶
Overview¶
The GOParser
class provides the API for parsing and accessing GO terms and
annotations. Each GO term is represented by a GOTerm
object, and each GO
anntoation is represented by a GOAnnotation
object.
Module List¶
goparser.annotation module¶
-
class
goparser.annotation.
GOAnnotation
(gene, term, evidence, db_id=None, db_ref=None, with_=None)¶ Bases:
object
Class representing an annotation of a gene with a GO term.
For a list of annotation properties, see the GAF 2.0 file format specification.
Parameters: -
gene
¶ str
The gene that is annotated (e.g., “MYOD1”).
-
evidence
¶ str
The three-letter evidence code of the annotation (e.g., “IDA”).
-
db_id
¶ str, optional
Database Object ID of the annotation.
-
db_ref
¶ list of str, optional
DB:Reference of the annotation.
-
with_
¶ list of str, optional
“With” information of the annotation.
-
get_gaf_format
()¶ Return the annotation as a tab-delimited string acccording to the GAF 2.0 file format.
-
get_pretty_format
()¶ Return a nicely formatted string representation of the GO annotation.
-
get_gaf_format
() Return a GAF 2.0-compatible string representation of the annotation.
Parameters: None – Returns: The formatted string. Return type: str
-
goparser.parser module¶
Module containing the GOParser class.
-
class
goparser.parser.
GOParser
(quiet=False, verbose=False)¶ Bases:
object
A class for accessing Gene Ontology (GO) term and annotation data.
This class provides functions for parsing text files describing the Gene Ontology and GO annotations, for accessing information about specific GO terms, as well as for querying the data for associations between genes and GO terms.
Parameters: - quiet (bool, optional) – If True, only warnings and errors will be reported.
- verbose (bool, optional) – If True, enable verbose logging (i.e., including debug messages).
If
quiet
is set to True, the value of this parameter is ignored.
-
terms
¶ dict [str:GOTerm]
A mapping of GO term IDs to
GOTerm
objects, each representing a single GO term. Populated by the member functionparse_ontology
.
-
genes
¶ set of str
A set of all “valid” gene names. Populated by the member function
parse_annotations
. Typically, this is the set of all protein-coding genes of a particular species. GOparser ignores all annotations for genes that are not in this set.
-
annotations
¶ list of GOAnnotation objects
A list of
GOAnnotation
objects, each representing a single GO annotation. Populated by the member functionparse_annotations
.
-
term_annotations
¶ dict [str:list of GOAnnotation objects]
A mapping of GO term IDs to lists of
GOAnnotation
objects, with each list representing all annotations that use a particular GO term.
-
gene_annotations
¶ dict [str:list of GOAnnotation objects]
A mapping of gene symbols to lists of
GOAnnotation
objects, with each list representing all annotations of a particular gene.
-
get_gene_goterms
(gene, ancestors=False)¶ Return all GO terms that the given gene is annotated with. If
ancestors
is set to True, also return all ancestor GO terms of those terms.
-
get_goterm_genes
(id_, descendants=True)¶ Return all genes annotated with the GO term corresponding to the given GO term ID. If
descendants
is set to True, also return genes annotated with any descendant GO term of this term. Since annotations should be propagated down to descendant terms, this is the default behavior.
-
save
(ofn, compress=False)¶ Stores the GOParser object as a
pickle
file. Ifcompress
is set to True, the object is stored as a gzip’ed pickle file.
Notes
The typical workflow for reading the GO annotations for a specific species looks as follows:
- Step 1) Extract a list of protein-coding genes using the script
extract_protein_coding_genes.py
from thegenometools
package. (See the GenomeTools documentation.)
- Step 2) Use the
parse_ontology
member function to parse thego-basic.obo
file, containing the Gene Ontology. (This file can be downloaded from the website of the Gene Ontology Consortium.)
- Step 3) Use the
parse_annotations
member function to parse a gene association file (GAF), containing annotations of genes with GO terms for a specific species. The list of protein-coding genes generated in Step 1) is used to only parse annotations for protein-coding genes. A species-specific file can be downloaded from the ftp server of the UniProt-GOA database.
Afterwards, the member functions and
get_term_by_id
andget_term_by_name
can be used to obtain GOTerm objects containing information about individual GO terms. The member functionget_gene_goterms
can be used to obtain a list of all GO terms a particular gene is annoatated with, and the member functionget_goterm_genes
can be used to obtain a list of all genes annotated with a particular GO term.Examples
The following example assumes that the Gene Ontology OBO file and the UniProt-GOA gene association files have been downloaded, and that a list of protein-coding genes named “protein_coding_genes_human.tsv” has been generated using the genometools Python package.
>>> from goparser import GOParser >>> G = GOParser() >>> GOParser.parse_ontology('go-basic.obo') >>> GOParser.parse_annotations('gene_association.goa_human.gz','protein_coding_genes_human.tsv') >>> print GOParser.get_gene_goterms('MYC')
-
get_gene_goterms
(gene, ancestors=False) Return all GO terms a particular gene is annotated with.
Parameters: - gene (str) – The gene symbol of the gene.
- ancestors (bool, optional) – If set to True, also return all ancestor GO terms.
Returns: The set of GO terms the gene is annotated with.
Return type: set of GOTerm objects
Notes
If a gene is annotated with a particular GO term, it can also be considered annotated with all ancestors of that GO term.
-
get_gene_sets
(min_genes=None, max_genes=None)¶ Return the set of annotated genes for each GO term.
Parameters: - min_genes (int, optional) – Exclude GO terms with fewer than this number of genes.
- max_genes (int, optional) – Exclude GO terms with more than this number of genes.
Returns: A gene set “database” with one gene set for each GO term.
Return type: GeneSetDB
-
get_goterm_genes
(id_, descendants=True) Return all genes that are annotated with a particular GO term.
Parameters: - id (str) – GO term ID of the GO term.
- descendants (bool, optional) – If set to False, only return genes that are directly annotated with the specified GO term. By default, also genes annotated with any descendant term are returned.
Returns: Return type: Notes
-
get_term_by_acc
(acc)¶ Get the GO term corresponding to the given GO term accession number.
Parameters: acc (int) – The GO term accession number. Returns: The GO term corresponding to the given accession number. Return type: GOTerm
-
get_term_by_id
(id_) Get the GO term corresponding to the given GO term ID.
Parameters: id (str) – A GO term ID. Returns: The GO term corresponding to the given ID. Return type: GOTerm
-
get_term_by_name
(name) Get the GO term with the given GO term name.
If the given name is not associated with any GO term, the function will search for it among synonyms.
Parameters: name (str) – The name of the GO term. Returns: The GO term with the given name. Return type: GOTerm Raises: ValueError
– If the given name is found neither among the GO term names, nor among synonyms.
-
parse_annotations
(annotation_file, genes, db_sel='UniProtKB', select_evidence=None, exclude_evidence=None, exclude_ref=None, strip_species=False, ignore_case=False)¶ Parse a GO annotation file (in GAF 2.0 format).
GO annotation files can be downloaded from the UniProt-GOA download site or from their FTP server.
Parameters: - annotation_file (str or unicode) – Path of the annotation file (in GAF 2.0 format).
- genes (List (tuple, set) of str) – List of valid gene names.
- db_sel (str, optional) – Select only annotations with this
DB
(column 1) value. If empty, disable filtering based on theDB
value. - select_evidence (list of str, optional) – Only include annotations with the given evidence codes.
It not specified, allow all evidence codes, except for those listed
in
exclude_evidence
. - exclude_evidence (list of str, optional) – Exclude all annotations with any of the given evidence codes.
If
select_evidence
is specified, this parameter is ignored. If not specified, allow all evidence codes. - exclude_ref (list of str, optional) – Exclude all annotations with the given DB:reference (column 6).
Example:
["PMID:2676709"]
. Note: This filter is currently ignored if an annotation has more than one reference. - strip_species (bool, optional) – Undocumented.
- ignore_case (bool, optional) – Undocumented.
Returns: Return type:
-
parse_ontology
(fn, flatten=True, part_of_cc_only=False)¶ Parse an OBO file and store GO term information.
This function needs to be called before
parse_annotations
, in order to read in the Gene Ontology terms and structure.Parameters: - fn (str) – Path of the OBO file.
- flatten (bool, optional) – If set to False, do not generate a list of all ancestors and descendants for each GO term. Warning: Without flattining, GOparser cannot propagate GO annotations properly.
- part_of_cc_only (bool, optional) – Legacy parameter for backwards compatibility. If set to True,
ignore
part_of
relations outside thecelluclar_component
domain.
Notes
The function erases all previously parsed data. The function requires the OBO file to end with a line break.
-
static
read_pickle
(fn)¶ Load a GOParser object from a pickle file.
The function automatically detects whether the file is compressed with gzip.
Parameters: fn (str) – Path of the pickle file. Returns: The GOParser object stored in the pickle file. Return type: GOParser
-
write_pickle
(ofn, compress=False)¶ Serialize the current GOParser object and store it in a pickle file.
Parameters: - ofn (str) – Path of the output file.
- compress (bool, optional) – Whether to compress the file using gzip.
Returns: Return type: Notes
Compression with gzip is significantly slower than storing the file in uncompressed form.
goparser.term module¶
-
class
goparser.term.
GOTerm
(id_, name, domain, definition, is_a, part_of)¶ Bases:
object
Class representing a GO term.
This class is used by
GOParser.parse_ontology()
to store all parsed GO term data.Parameters: -
id
¶ str
The ID of the GO term.
-
name
¶ str
The name of the GO term.
-
domain
¶ str
The domain of the GO term (e.g., “biological_process”).
-
definition
¶ str
The definition (description) of the GO term.
-
is_a
¶ set of str
Set of GO term IDs that this GO term is a “subtype” of.
-
part_of
¶ set of str
Set of GO term IDs that this GO term is a “part” of.
-
ancestors
¶ set of str
Set of GO term IDs that are “ancestors” of this GO term.
-
children
¶ set of str
Set of GO term IDs that are “children” of this GO term.
-
parts
¶ set of str
Set of GO term IDs that are “parts” of this GO term.
-
descendants
¶ set of str
Set of GO terms IDs that are “descendants” of this GO term.
-
get_pretty_format
(omit_acc=False, max_name_length=0, abbreviate=True)¶ Returns a formatted version of the GO term name and ID.
-
acc
¶ Returns the GO term accession number (part of the ID).
-
static
acc2id
(acc)¶ Converts a GO term accession number to an ID.
Parameters: acc (int) – A GO term accession number. Returns: The ID corresponding to the GO term accession number. Return type: str
-
domain_short
¶
-
get_pretty_format
(include_id=True, max_name_length=0, abbreviate=True) Returns a nicely formatted string with the GO term information.
Parameters: - include_id (bool, optional) – Include the GO term ID.
- max_name_length (int, optional) – Truncate the formatted string so that its total length does not exceed this value.
- abbreviate (bool, optional) – Do not use abberviations (see
_abbrev
) to shorten the GO term name.
Returns: The formatted string.
Return type:
-
License¶
GOparser Documentation¶
Copyright (c) 2015 Florian Wagner.
The GOparser Documentation is licensed under a Creative Commons BY-NC-SA 4.0 License.
GOparser¶
Copyright (c) 2015 Florian Wagner.
The source code of this documentation is part of GOparser.
GOparser is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License, Version 3,
as published by the Free Software Foundation.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.