Tabel of contents

Content

TinyFasta

PyPI package Travis CI build status (Linux) AppVeyor CI build status (Windows) Code Coverage Documentation Status

Python package for working with biological sequences from FASTA files.

Features

  • Easy to use: intuitive API for parsing, searching and writing FASTA files
  • Lightweight: no dependencies outside Python’s standard library
  • Cross-platform: Linux, Mac and Windows are all supported
  • Works with with Python 2.7, 3.2, 3.3, and 3.4

Quick Guide

To install the TinyFasta package:

sudo pip install tinyfasta

To parse a FASTA file:

>>> from tinyfasta import FastaParser
>>> for fasta_record in FastaParser("tests/data/dummy.fasta"):
...     if fasta_record.description.contains('seq1'):
...         print(fasta_record)
...
>seq1|contains 2x78 A's
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

To create a FASTA record:

>>> from tinyfasta import FastaRecord
>>> sequence = "C" * 100
>>> fasta_record = FastaRecord.create("My Sequence", sequence)
>>> print(fasta_record)
>My Sequence
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC

Installation

The tinyfasta package can be installed using pip.

sudo pip install tinyfasta

Alternatively, you can clone the package from GitHub and install it.

git clone git@github.com:tjelvar-olsson/tinyfasta.git
cd tinyfasta
sudo python setup.py install

Parsing FASTA files

To parse a FASTA file we make use of the tinyfasta.FastaParser class.

>>> from tinyfasta import FastaParser

To create a tinyfasta.FastaParser instance we simply need the path to the FASTA file of interest.

>>> fasta_parser = FastaParser('tests/data/dummy.fasta')
>>> fasta_parser.fpath
'tests/data/dummy.fasta'

We can then iterate over all the tinyfasta.FastaRecord instances in the FASTA file.

>>> for fasta_record in fasta_parser:  
...     print(fasta_record)
...
>seq1|contains 2x78 A's
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq2|starts with ATTA motif in first line
ATTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
...

Finding FASTA records

To find specific FASTA records one can simply iterate over the individual records in a particular FASTA file and check if the description and/or sequence contains a particular string or regular expression. Let us therefore start by creating a tinyfasta.FastaParser instance.

>>> from tinyfasta import FastaParser
>>> fasta_parser = FastaParser('tests/data/dummy.fasta')

Matching based on the description line

Now let us look for a FASTA record where the description contains the string seq1.

>>> for fasta_record in fasta_parser:
...     if fasta_record.description.contains('seq1'):
...         print(fasta_record)
...
>seq1|contains 2x78 A's
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Suppose we wanted to find all the FASTA records where the description line started with >seq1|, >seq2| or >seq3|. This query can be expressed using the regular expression below.

>>> import re
>>> search_term = re.compile(r'^>seq[1-3]\|')

We can use compiled regular expression to identify FASTA records of interest.

>>> for fasta_record in fasta_parser:
...     if fasta_record.description.contains(search_term):
...         print(fasta_record)
...
>seq1|contains 2x78 A's
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq2|starts with ATTA motif in first line
ATTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq3|ends with ATTA motif in second line
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTA

Matching based on the sequence

We can use a similar approach to check if a tinyfasta.FastaRecord contains a sequence motif.

Let us first look for records containing a simple ATTA motif.

>>> for fasta_record in fasta_parser:
...     if fasta_record.sequence.contains('ATTA'):
...         print(fasta_record)
...
>seq2|starts with ATTA motif in first line
ATTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq3|ends with ATTA motif in second line
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTA
>seq4|contains ATTA motif in middle of first line
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq5|contains ATTA motif split over two lines
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT
TAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

More complicated sequence motifs can be searched for by compiling regular expressions. Suppose we wanted to be able to identify any of the sequences below:

ACCCA
ACCTA
ACTTA
ATTTA
ATTCA
ATCCA

This could be achieved with the regular expression A[C,T]{3}A.

>>> motif = re.compile(r"A[C,T]{3}A")

Now let us find all the FASTA records that contain this motif.

>>> for fasta_record in fasta_parser:
...     if fasta_record.sequence.contains(motif):
...         print(fasta_record)
...
>seq7|contains ACCCA motif
AAAAAAAAAAAAAAAAAAAAAAAAAAACCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq8|contains ATTTA motif
AAAAAAAAAAAAAAAAAAAAAAAAAAATTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Matching based on the sequence length

The __len__() magic method of both the tinyfasta.Sequence and tinyfasta.FastaRecord classes return the length of the biological sequence. One can therefore use Python’s built-in len() function when looking for sequences of a particular length.

For example suppose we wanted to find all the sequences with fewer than 80 bases.

>>> for fasta_record in fasta_parser:
...     if len(fasta_record) < 80:
...         print(fasta_record)
...
>seq7|contains ACCCA motif
AAAAAAAAAAAAAAAAAAAAAAAAAAACCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq8|contains ATTTA motif
AAAAAAAAAAAAAAAAAAAAAAAAAAATTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Creating FASTA records

There are two ways of creating tinyfasta.FastaRecord instances. We can create them from a description and a long sequence string or we can build them up from a description and several sequence strings. The latter approach is used internally by the tinyfasta.FastaParser.

Using a long sequence string

Let us import the tinyfasta.FastaRecord class and create a description and sequence strings.

>>> from tinyfasta import FastaRecord
>>> description = 'My Sequence'
>>> sequence = 'C' * 500

We can now create a tinyfasta.FastaRecord from the description and sequence strings by using the tinyfasta.FastaRecord.create() static method.

>>> from tinyfasta import FastaRecord
>>> fasta_record = FastaRecord.create(description, sequence)

Let us print out the record to verify what we got.

>>> print(fasta_record)
>My Sequence
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCC

Using several sequence strings

However, suppose that we wanted to create a tinyfasta.FastaRecord from a file containing the input sequence split over several lines. In this scenario we can simply add the sequence lines one by one.

Let us create a tinyfasta.FastaRecord to add the sequence lines to.

>>> fasta_record = FastaRecord('Yet Another Record')

Now we can start adding sequence lines to it.

>>> fasta_record.add_sequence_line("AAAAAAAA")
>>> fasta_record.add_sequence_line("TTTTTTTTTTTT")
>>> fasta_record.add_sequence_line("CCCCCC")
>>> fasta_record.add_sequence_line("GGGGGGGGGGGGGGG")

Note that by default the string representation of the tinyfasta.FastaRecord will contain the original sequence line splits.

>>> print(fasta_record)
>Yet Another Record
AAAAAAAA
TTTTTTTTTTTT
CCCCCC
GGGGGGGGGGGGGGG

However, using the tinyfasta.FastaRecord.format_sequence_line_length() function we can standardised line length.

>>> fasta_record.sequence.format_line_length(30)
>>> print(fasta_record)
>Yet Another Record
AAAAAAAATTTTTTTTTTTTCCCCCCGGGG
GGGGGGGGGGG

API

Package for parsing and generating FASTA files of biological sequences.

Use the tinyfasta.FastaParser class to parse FASTA files.

To generate FASTA files use the tinyfasta.FastaRecord.create() static method to create tinyfasta.FastaRecord instances, which can be written to file.

class tinyfasta.Sequence

Class representing a biological sequence.

add_sequence_line(sequence_line)

Add a sequence line to the tinyfasta.Sequence instance.

This function can be called more than once. Each time the function is called the tinyfasta.Sequence is extended by the sequence line provided.

Parameters:sequence_line – string representing (part of) a sequence
format_line_length(line_length=80)

Format line length used to represent the sequence.

The full sequence is stored as list of shorter sequences. These shorter sequences are used verbatim when writing out the tinyfasta.FastaRecord over several lines.

Parameters:line_length – length of the sequences used to make up the full sequence
class tinyfasta.FastaRecord(description)

Class representing a FASTA record.

class Description(description)

Description line in a tinyfasta.FastaRecord.

update(description)

Update the content of the description.

This function can be used to replace the existing description with a new one.

Parameters:description – new description string
FastaRecord.add_sequence_line(sequence_line)

Add a sequence line to the tinyfasta.FastaRecord instance.

This function can be called more than once. Each time the function is called the tinyfasta.sequence is extended by the sequence line provided.

Parameters:sequence_line – string representing (part of) a sequence
static FastaRecord.create(description, sequence)

Return a FastaRecord.

Parameters:
  • description – description string
  • sequence – full sequence string
Returns:

tinyfasta.FastaRecord

class tinyfasta.FastaParser(fpath)

Class for parsing FASTA files.