Welcome to ncfp’s documentation!

https://img.shields.io/travis/widdowquinn/ncfp.svg?style=flat-square https://img.shields.io/codecov/c/github/widdowquinn/ncfp.svg?style=flat-square https://api.codacy.com/project/badge/Grade/99a037e5eb2c4ae48e27e44c8974a3f8 https://readthedocs.org/projects/ncfp/badge/?version=latest

https://img.shields.io/pypi/v/ncfp.svg?style=flat-square https://img.shields.io/pypi/l/ncfp.svg?style=flat-square

https://anaconda.org/bioconda/ncfp/badges/version.svg https://anaconda.org/bioconda/ncfp/badges/latest_release_date.svg

https://anaconda.org/bioconda/ncfp/badges/installer/conda.svg https://anaconda.org/bioconda/ncfp/badges/downloads.svg https://anaconda.org/bioconda/ncfp/badges/platforms.svg

If you’re feeling impatient, please head over to the QuickStart Guide

Description

ncfp is a program and Python package that, given a set of protein sequences with appropriate identifier values, retrieves corresponding coding nucleotide sequences from the NCBI nucleotide databases. This is useful, for instance, to help backthread coding sequences onto protein alignments for phylogenetic analyses.

Reporting problems and requesting improvements

If you encounter bugs or errors, or would like to suggest ways in which ncfp can be improved, please raise a new issue at the ncfp GitHub issues page.

If you’d like to fix a bug or make an improvement yourself, contributions are welcomed, and guidelines on how to do this can be found at the Contributing documentation page.

Getting started

To get started, please read the documentation, below:

About ncfp

ncfp is a program and accompanying Python package (ncbi_cds_from_protein) that, given a set of protein sequences with appropriate identifier values, retrieves corresponding coding nucleotide sequences from the NCBI nucleotide databases. This is useful, for instance, to help backthread coding sequences onto protein alignments for phylogenetic analyses.

How ncfp works

1. The ncfp program accepts input FASTA sequence files describing protein sequences. These describe a set of queries that will be used to obtain corresponding nucleotide coding sequences, if possible.

Input sequence formats
Please see Input sequence formats.

2. An SQLite cache database is created. This will hold information about the query sequences, and about the data downloaded from NCBI using the input sequence data as queries. The cache enables recovery from interrupted jobs, and reuse of data without needing to transfer across the network again by interrogating the SQLite database directly.

Cache directory
By default, the cache is created in the hidden subdirectory .ncfp_cache, but any location can be specified using the -d or --cachedir arguments.
Cache filename
By default, the cache has a filestem reflecting the date and time that ncfp is run, but this can be changed using the -c or --cachestem arguments.

3. The Biopython Entrez library is used to make a connection to the NCBI sequence databases. Using this connection, the program identifies nucleotide database coding sequence entries that are related to each input protein sequence. The relationship is determined on the basis of either the NCBI accession, or the UniProt gene identifier, depending on the input file.

Batched downloads
By default, ncfp makes queries and downloads sequences in batches of 100. The batch size can be controlled using the arguments -b or --batchsize
Retry attempts
Sometimes network connections are flaky, so by default ncfp will try each request up to 10 times. The number of retries can be controlled with the -r or --retries arguments.

4. If the results of each NCBI query are not already present in the cache, they are downloaded and recorded in the cache as header information. Some specific data are extracted, (sequence length, taxonomy, etc.)

5. The shortest available complete coding sequence [1] that recapitulates each input protein sequence is identified. If the sequence is not already present in the cache, it is downloaded.

6. Pairs of protein and corresponding coding sequences are written to two files: one for nucleotide sequences, and one for protein sequences. Sequences are written to each file in the same order, so they can be used for backtranslation with a tool such as T-Coffee. If any proteins could not be matched to their coding sequence at NCBI, they are written to a third file.

Output filenames
The filestem for the paired protein and coding sequence files is always suffixed by _aa or _nt, depending on the type of sequence being written. The filestem is ncfp by default, but this can be controlled with the --filestem argument.
Skipped sequences filename
By default, the protein sequences for which a coding sequence could not be found are written to the skipped.fas file. An alternative path can be provided with the --skippedfile argument.

Footnotes

[1]We require the complete coding sequence, but if we can use a shorter sequence than the complete genome, we do to save bandwidth.

QuickStart Guide

Installation

Using conda

ncfp is available through the bioconda channel of Anaconda:

conda install -c bioconda ncfp
From source

At the command-line, use git to clone the current version of the ncfp repository:

git clone git@github.com:widdowquinn/ncfp.git

Change to the newly-created ncfp subdirectory:

cd ncfp

Install the package and program, using the setup.py file:

python setup.py install

(other installation methods can be found on the Installation page)

ncfp Example

To see options available for the ncfp program, use the -h (help) option:

ncfp -h

In the ncfp/tests/test_input/sequences subdirectory there is a file called input_ncbi.fasta. This contains a number of protein sequences in FASTA format. The identifier for each sequence in this file is a valid NCBI sequence identifier.

Using ncfp, to obtain a corresponding nucleotide coding sequence for each protein, issue the following command (substituting your own email address, where indicated):

ncfp tests/test_input/sequences/input_ncbi.fasta \
     example_output \
     my.name@my.domain

You should see progress bars appear for processing of the input protein, sequences, searching those sequences against the remote NCBI databases, then retrieving the corresponding sequence identifiers, GenBank headers and finally the full GenBank records.

On completion, a list of the recovered sequences will be presented, and the directory example_output will be created, with the following contents:

$ tree example_output/
example_output/
├── ncfp_aa.fasta
└── ncfp_nt.fasta

The two files should contain corresponding amino acid and nucleotide sequences:

$ head example_output/*.fasta
==> example_output/ncfp_aa.fasta <==
>XP_004520832.1 kunitz-type serine protease inhibitor homolog dendrotoxin I-like [Ceratitis capitata]
MRTKFVLVFALIVCVLNGLGEAQRPAHCLQPHPQGVGRCDMLISGFFYNSERNECEQWTE
EGCRVQGGHTYDFKEDCVNECIEIN
>XP_017966559.1 PREDICTED: kunitz-type serine protease inhibitor homolog dendrotoxin I-like [Drosophila navojoa]
MKFILLLACLCVYVATLEAQRPPCKGIVPPWLTNCVGGKNEGRGNLRSCARNANSRMWWY
DSRSRSCKKMAYKGCGGNRNRYCTREACRRACRRRN
>XP_017841791.1 PREDICTED: kunitz-type serine protease inhibitor homolog dendrotoxin K-like [Drosophila busckii]
MKVCLILSALVLQYIVFVNAEGCPLRPAEQNCQSSRNVGVSSYSNCILTKRLMWYYNPTI
RDCLPLDFRGCGGNGNRYCSLKDCQQSCKHT
>XP_017046608.1 PREDICTED: kunitz-type serine protease inhibitor homolog dendrotoxin I [Drosophila ficusphila]

==> example_output/ncfp_nt.fasta <==
>XP_004520832.1 coding sequence
ATGAGAACTAAATTTGTTTTGGTATTCGCGCTCATTGTTTGTGTACTCAACGGTTTAGGT
GAAGCGCAAAGACCAGCACATTGCTTACAACCACATCCACAAGGAGTTGGCCGTTGTGAT
ATGCTTATCAGTGGTTTCTTCTATAACTCGGAGCGTAATGAGTGCGAGCAATGGACAGAG
GAGGGCTGCCGTGTGCAGGGTGGGCACACATACGATTTCAAAGAAGATTGTGTAAATGAG
TGCATTGAAATTAATTAA
>XP_017966559.1 coding sequence
ATGAAATTCATTCTGCTCCTCGCTTGTCTCTGCGTCTACGTGGCCACCCTTGAGGCTCAG
CGACCCCCTTGCAAGGGAATAGTGCCTCCATGGTTGACCAATTGTGTTGGAGGCAAGAAC
GAGGGCAGGGGTAACCTTCGCTCGTGCGCCAGGAACGCGAATTCCAGAATGTGGTGGTAT

Requirements

To use ncfp you will need:

Operating System
  • macOS or Linux
Python
  • Python version 3.5 or higher
Python Packages

Installation

ncfp can be installed in any of several ways.

Using pip

The most recent release of ncfp is available at the PyPI warehouse, and can be installed using pip:

pip install ncfp

Using bioconda

ncfp is available through the bioconda channel of the conda package management system. To install ncfp, you will need Anaconda or miniconda, and to set up the bioconda channel. Then you can use the conda install ncfp command to install the package:

conda install -c bioconda ncfp

Alternatively:

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda install ncfp

From source (most recent release)

ncfp releases are available at the GitHub releases page. To install from source:

  1. Download the .tar.gz or .zip file containing the package (click the link, use curl or wget).
  2. Uncompress the archive.
  3. Change to the newly-expanded directory.
  4. Install Python module requirements with pip.
  5. Install ncfp with setup.py
$ wget https://github.com/widdowquinn/ncfp/archive/v0.1.0.tar.gz
$ tar -zxvf v0.1.0.tar.gz
$ cd ncfp
$ pip install -r requirements.txt
$ python setup.py install

From source (bleeding edge)

To get the very latest development version of ncfp, you can clone the repository from the GitHub project page

  1. Clone the repository from git@github.com:widdowquinn/ncfp.git
  2. Change to the repository root directory.
  3. Install Python module requirements with pip.
  4. Install ncfp with setup.py
$ git clone git@github.com:widdowquinn/ncfp.git
$ cd ncfp
$ pip install -r requirements.txt
$ python setup.py install

Basic Use

Given a set of protein sequences in the file <INPUT>.fasta as input, the path to an output directory as <OUTPUT>, and the user’s email address [1] as <EMAIL>, the following command will query the NCBI databases for nucleotide coding sequences corresponding to the input, and write the results to files in <OUTPUT>:

ncfp <INPUT>.fasta <OUTPUT> <EMAIL>

The output directory <OUTPUT> will contain at least two files: <OUTPUT>/ncfp_aa.fasta and <OUTPUT>/ncfp_nt.fasta. The <OUTPUT>/ncfp_aa.fasta file will contain sequences for which a corresponding coding sequence could be found, and <OUTPUT>/ncfp_nt.aa will contain those coding sequences.

The sequences in <OUTPUT>/ncfp_nt.aa are trimmed and validated by ncfp to produce a conceptual translation identical to the corresponding input protein sequence.

Any protein sequences for which a partner nucleotide coding sequence could not be found will be written to the file <OUTPUT>/skipped.fas

Input sequence formats

Input protein sequences must be provided in FASTA format, and ncfp expects input sequence headers to take one of two forms: “NCBI” or “UniProt”. By default, ncfp expects sequences to be in NCBI format:

ncfp <INPUT>.fasta <OUTPUT> <EMAIL>

For sequence input in UniProt format, one of the -u or --uniprot options must be used, e.g.

$ ncfp -u <INPUT>.fasta <OUTPUT> <EMAIL>
$ ncfp --uniprot <INPUT>.fasta <OUTPUT> <EMAIL>
NCBI header format

In NCBI header format, the sequence identifier is expected to correspond to a valid NCBI protein sequence accession, e.g.

>XP_004520832.1 kunitz-type serine protease inhibitor homolog dendrotoxin I-like [Ceratitis capitata]
MRTKFVLVFALIVCVLNGLGEAQRPAHCLQPHPQGVGRCDMLISGFFYNSERNECEQWTEEGCRVQGGHT
YDFKEDCVNECIEIN

If a coding sequence is identified successfully, the output nucleotide sequence header will have the same accession as a sequence identifier, e.g.

>XP_004520832.1 coding sequence
ATGAGAACTAAATTTGTTTTGGTATTCGCGCTCATTGTTTGTGTACTCAACGGTTTAGGT
GAAGCGCAAAGACCAGCACATTGCTTACAACCACATCCACAAGGAGTTGGCCGTTGTGAT
ATGCTTATCAGTGGTTTCTTCTATAACTCGGAGCGTAATGAGTGCGAGCAATGGACAGAG
GAGGGCTGCCGTGTGCAGGGTGGGCACACATACGATTTCAAAGAAGATTGTGTAAATGAG
TGCATTGAAATTAATTAA
UniProt header format

In UniProt header format, the sequence description string is expected to correspond to a UniProt download and contain the GN gene identifier key:value pair, e.g.

>tr|A0A1V9Y7A7|A0A1V9Y7A7_9STRA Lon protease homolog OS=Thraustotheca clavata GN=THRCLA_11583 PE=3 SV=1
MYRASSKVTSAHNDGIWSTVWTSRNQIISGSLDEVVKSWDASSSEDNAILPVVKQFPGHV
LGTLAVTATKDGRKAATSSLDCQVRILNLESGGIEKTIDTGAGESWQLVYSPDDTFIATG
SQQSKINLINLEQEKIVNSIPVDGKFILAVAYSPDGKHLACGTFEGIVAIYDVETGKQVQ
KYQDRAKPVRSISYSPDGSFLLAASDDMHVNIYDVLHSSLVGSVSGHISWILSVACSPDG

If a coding sequence is identified successfully, the output nucleotide sequence header will have the gene accession as its sequence identifier, e.g.

>THRCLA_11583 coding sequence
ATGTACCGCGCCTCGTCCAAAGTAACGTCGGCTCATAATGATGGAATCTGGAGTACTGTC
TGGACAAGCCGCAATCAAATCATAAGTGGATCTTTGGATGAAGTGGTCAAGAGCTGGGAT
GCGAGTAGTTCCGAGGACAATGCGATTTTGCCTGTTGTCAAGCAATTTCCAGGCCACGTT
CTAGGCACACTGGCAGTGACTGCAACGAAAGATGGTCGAAAAGCTGCTACATCGTCTTTA
Stockholm domain format

UniProt and other sources use Stockholm format to indicate that an amino acid sequence represents a portion of a protein (such as a domain). ncfp can recognise this format and trim the coding sequence to correspond only to the specified region of the protein.

Stockholm format domains are indicated by the syntax /<start>-<stop> immediately following the sequence identifier in FASTA format, e.g.

>tr|B7G6L2|B7G6L2_PHATC/43-112 [subseq from] Predicted protein OS=Phaeodactylum tricornutum (strain CCAP 1055/1) GN=PHATRDRAFT_48282 PE=4 SV=1
-----------------------------SLCV-EVAGA-SQD---DGASIFQGDCN-dG
NKHQVFDFipaPG---TdsgFHRIRA--SHSN-KCLGVADGAL--APG-AEVVQ-

To restrict the coding sequence to the region indicated in Stockholm format, pass either the -s or --stockholm option, e.g.

$ ncfp -u -s <INPUT>.fasta <OUTPUT> <EMAIL>
$ ncfp --uniprot --stockholm <INPUT>.fasta <OUTPUT> <EMAIL>

The output nucleotide sequence does not preserve the Stockholm format location information in the output, nor does it preserve sequence gap symbols:

>PHATRDRAFT_48282 coding sequence
TCGCTCTGCGTGGAGGTGGCTGGAGCGAGCCAAGACGACGGGGCCTCCATATTTCAAGGG
GATTGTAATGACGGAAACAAGCATCAAGTCTTCGACTTCATTCCTGCTCCCGGTACAGAC
AGCGGTTTTCATCGAATTCGAGCCTCGCACTCCAACAAGTGCCTTGGCGTGGCTGATGGG
GCTTTAGCACCTGGAGCTGAGGTAGTGCAA
[1]The user’s email address is passed to NCBI to enable them to monitor use of their service and provide support

Examples

NCBI format input sequences - no introns

The command below [1] identifies coding sequences from NCBI format [2] input for nine proteins that do not contain introns.

ncfp tests/test_input/sequences/input_ncbi.fasta \
    tests/test_output/ncbi dev@null.com -v

UniProt format input sequences - no introns

The command below [1] identifies coding sequences from UniProt format [3] input for ten proteins that do not contain introns. The -u or --uniprot argument is required to specify that the input sequences are UniProt format, otherwise an error is thrown.

ncfp -u tests/test_input/sequences/input_uniprot.fasta \
    tests/test_output/uniprot dev@null.com -v

UniProt/Stockholm input sequences - no introns

The command below [1] identifies coding sequences from UniProt format [3] input for 57 amino acid sequences specifying regions of a protein in Stockholm notation [4]. The -u or --uniprot argument is required to specify that the input sequences are UniProt format, and the -s or --stockholm arguments are required to tell ncfp to parse the region locations.

ncfp -us tests/test_input/sequences/input_uniprot_stockholm.fasta \
    tests/test_output/uniprot_stockholm dev@null.com -v

Human sequences - isoforms and intron/exon structure

The command below [1] identifies coding sequences from NCBI format [3] input for four human proteins with intron/exon structure, including three isoforms of the same protein from the same locus (GPR137: NP_001164351.1, NP_001164352.1, and XP_005274161.1).

ncfp tests/test_input/sequences/human.fasta \
    tests/test_output/human dev@null.com -v

Logging

Verbose output can be written persistently to a logfile using the -l or --logfile argument and specifying the path to which the logfile should be written. An example is given in the command below.

ncfp tests/test_input/sequences/human.fasta \
    tests/test_output/logging dev@null.com \
    -l tests/test_output/logging/human.log

Specifying the cache location

By default a new cache database is created every time that ncfp is run, in the .ncfp_cache hidden subdirectory. The default cache database filename is ncfpcache_YYYY-MM-DD-HH-MM-SS.sqlite3, indicating the time that the command was run. This location and naming convention can be overridden with the -d/--cachedir and -c/--cachestem arguments, as in the command below.

ncfp tests/test_input/sequences/human.fasta \
    tests/test_output/caches dev@null.com \
    -d tests/test_output/caches \
    -c ncfp_cache

Reusing an existing cache

To avoid unnecessary bandwidth/NCBI queries, an existing cache database can be used. The location of the cache is specified with the -d/--cachedir and -c/--cachestem arguments, and the --keepcache option must be specified. If the specified location does not contain a cache database, one is created. For example:

ncfp tests/test_input/sequences/human.fasta \
    tests/test_output/caches dev@null.com \
    -d tests/test_output/caches \
    -c ncfp_cache

will create a cache at tests/test_output/caches/ncfp_cache.sqlite3, and

ncfp tests/test_input/sequences/human.fasta \
    tests/test_output/caches dev@null.com \
    -d tests/test_output/caches \
    -c ncfp_cache \
    --filestem cached \
    --keepcache

will reuse the cachefile without making new queries at NCBI, and write the output to cached_aa.fasta and cached_nt.fasta [5].

Footnotes

[1](1, 2, 3, 4) The -v option shows verbose output in STDOUT.
[2]The sequence identifier in the FASTA header is a valid NCBI protein accession.
[3](1, 2, 3) The sequence description in the FASTA header contains a valid GN=<accession> gene identifier.
[4]The sequence identifier in the FASTA header ends with a Stockholm format region definition, e.g. /47-134.
[5]The --filestem argument changes the filestem of the output nucleotide and amino acid sequence files.

Testing

ncfp tests are implemented in the Nose framework [1].

To run all tests locally, please issue the command:

nosetests -v

Obtaining test coverage information

The Nose framework integrates with the coverage.py module, and an account of the extent of test coverage can be obtained by running the following command:

nosetests -v --with-coverage \
    --cover-package=ncbi_cds_from_protein \
    --cover-html

Footnotes

[1]We are aware that nosetests is in maintenance mode, and a move to py.test is planned.

Contributing

Reporting bugs and errors

If you find a bug, or an error in the code or documentation, please report this by raising an issue at the GitHub issues page for ncfp

Contributing code or documentation

We gratefully accept code contributions, if you would like to fix a bug, improve documentation, or extend ncfp. To make everyone’s lives easier in this process, we ask that you please follow the procedure below:

  1. Fork the ncfp repository under your account at GitHub.
  2. Clone your fork to your development machine.
  3. Create a new branch in your forked repository with an informative name, e.g. git checkout -b fix_issue_107.
  4. Make the changes.
  5. Run the repository tests (see the Testing documentation for more details).
  6. If the tests pass, push the changes to your fork, and submit a pull request against the original repository.
  7. Indicate one of the ncfp developers as an assignee in your pull request.

Suggestions for improvement

If you would like to make a suggestion for how we could improve ncfp, we welcome contributions at the GitHub issues page.

Licensing

Unless otherwise indicated, all code is subject to the following agreement:

(c) The James Hutton Institute 2017-2018 Author: Leighton Pritchard

Contact: leighton.pritchard@hutton.ac.uk

Address:
Leighton Pritchard, Information and Computational Sciences, James Hutton Institute, Errol Road, Invergowrie, Dundee, DD6 9LH, Scotland, UK

The MIT License

Copyright (c) 2017-2018 The James Hutton Institute

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.