AbTools: Utilities for antibody sequence analysis¶
AbTools provides core data structures and methods for analysis of antibody repertoire data. Tools for manipulating sequences, pairwise and multiple sequence alignment and sequence clustering are included.
Additionally, AbTools provides utilities for secondary analysis of antibody sequence data. Error correction, mining NGS datasets for sequences with similarity to known antibody sequences, generating lineage phylogenies, and making repertoire-level comparisons are all covered.
Getting Started¶
Install¶
The easiest way to install AbTools locally (on OSX or Linux) is to use pip:
$ pip install abtools
If you don’t have pip, the Anaconda Python distribution contains pip along with a ton of useful scientific Python packages and is a great way to get started with Python.
AbTools does not run natively on Windows, but Windows users can run AbTools with Docker (AbTools is included in the AbStar Docker image):
$ docker pull briney/abstar
$ docker run -it briney/abstar
Stable and development versions of AbTools can also be downloaded from Github. You can manually install the latest development version of AbTools with:
$ git clone https://github.com/briney/abtools
$ cd abtools/
$ python setup.py install
Note
If installing manually via setup.py and you don’t already have scikit-bio installed, you may get an error when setuptools attempts to install scikit-bio. This can be fixed by first installing scikit-bio with pip:
$ pip install scikit-bio
and then retrying the manual install of AbTools.
Requirements¶
- Python 2.7.x (Python 3 compatability is in the works)
- biopython
- celery
- ete2
- matplotlib
- pandas
- pymongo
- scikit bio
- seaborn
Additional dependencies¶
Several AbTools components have additional non-Python dependencies:
abtools.alignment
requires MAFFT and MUSCLEabtools.correct
requires CDHIT and USEARCHabtools.mongodb
requires MongoDBabtools.phylogeny
requires MUSCLE and FastTreeabtools.s3
requires s3cmd
If using Docker, all of the the non-Python dependencies are included.
Commandline Use¶
AbCompare¶
Overview¶
AbCompare is used to perform repertoire-level comparison of antibody sequence data using a variety of similarity and divergence measures. Currently, AbCompare compares samples using the frequency of V-gene use, although other comparison types (such as clonality) are planned. However, the underlying similarity and divergence functions are accessible via the AbTools API, so you can compare samples using other characteristics.
Similarity (or divergence) scores are computed by subsampling each dataset and computing the score for the subsamples. This process is repeated many times, and the median score for all of the iterations is returned. In addition to producing a more accurate representation of the true score, it also makes is possible to directly compare datasets of different sizes.
Examples¶
To compute the Marisita-Horn similarity of two collections, both in the same MongoDB database:
$ abcompare -d MyDatabase -1 Collection1 -2 Collection2 -o /path/to/output/
If only one collection is provided (via -1
), then that collection will
be iteratively compared to all other collections in the database:
$ abcompare -d MyDatabase -1 Collection1 -o /path/to/output/
If you leave out collections entirely, all collections in the database will be iteratively compared to all other collections:
$ abcompare -d MyDatabase -o /path/to/output/
Finally, if you’d like to compare only those collections that share a common
prefix (for example, if your collection names are formatted as SubjectName_Timepoint
and you’d like to compare all the timepoints from a single subject):
$ abcompare -d MyDatabase --collection-prefix SubjectName -o /path/to/output/
The default comparison method is Marisita-Horn similiarity, but several other methods are provided:
- Marisita-Horn similarity (
'marisita-horn'
)- Kullback-Leibler divergence (
'kullback-leibler'
)- Jensen-Shannon similarity (
'jensen-shannon'
)- Jaccard similarity (
'jaccard'
)- Bray-Curtis similarity (
'bray-curtis'
)- Renkonen similarity (
'renkonen'
)- Cosine similarity (
'cosine'
)
To use an alternate comparison method, pass the method with the --method
option:
$ abcompare -d MyDatabase -o /path/to/output/ --method jaccard
The number of sequences used in each iteration (--chunksize
, default is 100,000) and the
number of iterations (--iterations
, default is 10,000) can also be changed:
$ abcompare -d MyDatabase -o /path/to/output --iterations 1000 --chunksize 25000
As with other AbTools applications, there are options for connecting to remote MongoDB
servers (--ip
and --port
) and MongoDB authentication (--user
and --password
).
A complete list of AbCompare options can be obtained by:
$ abcompare --help
AbCorrect¶
Overview¶
AbCorrect is a full-featured utility for performing error-correction on antibody repertoire sequencing data. Error correction can be performed using Unique Antibody IDs (UAIDs; also known as molecular barcodes) or by identity clustering.
The AbCorrect command-line application is designed to work with antibody sequence data that has already been annotated with AbStar. Although other error correction tools for antibody repertoire sequencing operate on raw data (or, in the case of paired reads, on raw data after read merging), we have found that annotating the sequences with AbStar before error correction allows us to focus clustering and consensus/centroid generation on just the VDJ region of the antibody sequence in the proper orientation. In our hands, this tends to produce more accurate, reproducible results.
In addition to being provided as a command-line application, the core functionality can be accessed through the AbTools API, which allows AbCorrect to be integrated into more sophisticated sequence processing pipelines. Below are several examples showing how to use AbCorrect as a command-line application.
Examples¶
The simplest use case for AbCorrect is to perform error correction on a single JSON file, which is the output from running AbStar on a single FASTA/Q file:
$ abcorrect -j /path/to/MyData.json -t /path/to/temp/ -o /path/to/output/
This will cluster sequences based on UAIDs (which have been pre-parsed by AbStar)
and generate a ‘consensus’ sequence for each UAID cluster containing at least one
sequence (‘consensus’ is in quotes because it isn’t truly a consensus for UAID bins
with a single sequence). Output will be a single FASTA file of error-corrected sequences,
located at /path/to/output/MyData.fasta
.
To perform the same operation, but only calculate consensus sequences for UAID clusters with at least 3 sequences:
$ abcorrect -j /path/to/MyData.json -t /path/to/temp/ -o /path/to/output/ --min 3
If you want to correct errors using UAIDs but you forgot to have AbStar parse them, you can have AbCorrect parse them by passing the length of the barcode (in nucleotides):
$ abcorrect -j /path/to/MyData.json -t /path/to/temp/ -o /path/to/output/ --parse-uaids 20 --min 3
This will use the first 20nt of the raw merged read as the UAID. If the UAID is at the end
of the read (for paired reads, this would be the start of R2), use a negative number for
--parse-uaids
.
To cover the relatively rare case (assuming the UAID length was selected appropriately) where two
sequences were tagged with the same barcode, AbCorrect clusters the sequences within each
UAID bin and builds a consensus/centroid sequence for each subcluster that passes the
--min
size threshold. To disable this, you can pass the --largest-cluster-only
option
and AbCorrect will only build a consensus/centroid sequence for the largest cluster within
each UAID bin.
To perform error-correction using identity clustering instead of UAIDs, you can:
$ abcorrect -j /path/to/MyData.json -t /path/to/temp/ -o /path/to/output/ --no-uaids
This will cluster the sequences at an identity threshold (default is 0.975, or 97.5% identity) and build a consensus sequemce for each cluster. To cluster with a threshold of 0.96 instead:
$ abcorrect -j /path/to/MyData.json -t /path/to/temp/ -o /path/to/output/ -I 0.96 --no-uaids
If you have more than one JSON file to be processed, you can pass AbCorrect a directory that contains one or more JSON files and each JSON file will be iteratively processed:
$ abcorrect -j /path/to/JSONs/ -t /path/to/temp/ -o /path/to/output/
All of the other options (such as the minimum number of sequences for consensus/centroid calculation) remain, although there is currently no way to specifiy different options for each JSON file.
If your AbStar-annotated sequences have already been uploaded to MongoDB, you can still
use AbCorrect to perform error correction. Rather than passing JSON files with -j
, you can
pass a MongoDB database name with -d
and a collection name with -c
:
$ abcorrect -d MyDatabase -c MyCollection -t /path/to/temp/ -o /path/to/output/
If you supply just the database name (without a collection), AbCorrect will iteratively process all collections in the supplied database:
$ abcorrect -d MyDatabase -t /path/to/temp/ -o /path/to/output/
The above example is querying MyDatabase on your local instance of MongoDB. To do the same
thing on a remote MongoDB server, you can pass the IP address with -i
(assuming the
default port of 27017
:
$ abcorrect -d MyDatabase -i 123.45.67.89 -t /path/to/temp/ -o /path/to/output/
If your MongoDB server uses a port other than 27017
, you can provide it using the --port
option. And if your remote MongoDB server requires authentication, you can supply the username with
--user
and the password with --password
. If you don’t supply both --user
and
--password
, AbCorrect will attempt to connect to the MongoDB database without authentication.
Finally, to make non-redundant set of sequences, AbCorrect provides the --nr
option:
$ abcorrect -d MyDatabase -t /path/to/temp/ -o /path/to/output/ --nr
This uses sort | uniq
, which is much faster than clustering at 100% identity with CD-HIT.
Warning
Using --nr
is is not the same as clustering at 100% identity. Two sequences that are
different lengths but are otherwise identical will be collapsed when clustering with CD-HIT
but will not be collapsed when using sort | uniq
.
AbFinder¶
Overview¶
AbFinder provides methods to mine large datasets of antibody sequences to rapidly identify sequences with high identity to known antibody sequences.
Given a MongoDB database and collection, AbFinder computes identity between one or more ‘standard’ sequences and all seqeunces in the collection. Default output is an identity/divergence plot, a hexbin plot of germline identity (X-axis) and identity to the standard sequence (Y-axis). AbFinder also updates MongoDB with identity information so that standard identities can be used in subsequent queries.
Examples¶
To run, AbFinder needs a MongoDB database and collection, an output directory, and a FASTA-formatted file of standard sequences:
$ abfinder -d MyDatabase -c MyCollection -s standards.fasta -o /path/to/output/
Omitting the collection results in AbFinder iteratively processing each collection
in the database. By default, AbFinder assumes that the standard file contains
amino acid sequences. If you would like to compute nucleotide identity instead,
you can indicate your preference with the --nucleotide
option:
$ abfinder -d MyDatabase -s standards.fasta -o /path/to/output/ --nucleotide
AbFinder also assumes that the standard file contains heavy chain sequences, and only
heavy chain sequences from MongoDB will be used for comparison. To compare sequences
of a different chain (options are 'heavy'
, 'kappa'
, and 'lambda'
), use
the --chain
option:
$ abfinder -d MyDatabase -s standards.fasta -o /path/to/output/ --chain kappa
If you do not plan on using the identity scores for any sort of downstream analysis, you can save some time and skip the MongoDB updates and just make the identity/divergence figures:
$ abfinder -d MyDatabase -s standards.fasta -o /path/to/output/ --no-update
There are several other options, mainly related to formatting the identity/divergence figures. A complete list of all options can be obtained with:
$ abfinder --help
AbPhylogeny¶
Overview¶
AbPhylogeny generates figure-quality phylogenetic trees from antibody sequence data. Designed with the ability to color individual sequences by attribute, phylogenetic trees can be drawn that accurately represent data from longitugindal samplings, different sampling locations (peripheral blood, bone marrow, FNA, etc), or categorical genetic characteristics like isotype.
AbPhylogeny can take input on any of three levels:
- FASTA-formatted sequence files
- FASTA-formatted multiple sequence alignment
- Newick-formatted tree file
If given sequence files, AbPhylogeny will perform multiple sequence alignment with MUSCLE, build a tree file from the alignment with FastTree, and draw the tree figure. If given an alignment, AbPhylogeny will build the tree file and draw the figure. If given a tree file, AbPhylogeny will simply draw the figure. In each case, AbPhylogeny will save all intermediate files to the output directory, so intermediates can be used to speed up multiple iterations on the same figure. This is especially helpful when trying multiple variations (colors, fontsizes, etc) of the same figure.
Examples¶
API¶
API Examples¶
API Reference¶
Core Utilities¶
abtools.alignment
: Pairwise and Multiple Sequence Alignment¶
-
abtools.alignment.
mafft
(sequences=None, alignment_file=None, fasta=None, fmt='fasta', threads=-1, as_file=False, print_stdout=False, print_stderr=False)¶ Performs multiple sequence alignment with MAFFT.
MAFFT is a required dependency.
Parameters: - sequences (list) –
Sequences to be aligned.
sequences
can be one of four things:- a FASTA-formatted string
- a list of BioPython
SeqRecord
objects - a list of AbTools
Sequence
objects - a list of lists/tuples, of the format
[sequence_id, sequence]
- alignment_file (str) – Path for the output alignment file. If not supplied,
a name will be generated using
tempfile.NamedTemporaryFile()
. - fasta (str) – Path to a FASTA-formatted file of sequences. Used as an
alternative to
sequences
when suppling a FASTA file. - fmt (str) – Format of the alignment. Options are ‘fasta’ and ‘clustal’. Default is ‘fasta’.
- threads (int) – Number of threads for MAFFT to use. Default is
-1
, which results in MAFFT usingmultiprocessing.cpu_count()
threads. - as_file (bool) – If
True
, returns a path to the alignment file. IfFalse
, returns a BioPythonMultipleSeqAlignment
object (obtained by callingBio.AlignIO.read()
on the alignment file).
Returns: - Returns a BioPython
MultipleSeqAlignment
object, unlessas_file
isTrue
, in which case the path to the alignment file is returned.
- sequences (list) –
-
abtools.alignment.
muscle
(sequences=None, alignment_file=None, fasta=None, fmt='fasta', as_file=False, maxiters=None, diags=False, gap_open=None, gap_extend=None)¶ Performs multiple sequence alignment with MUSCLE.
MUSCLE is a required dependency.
Parameters: - sequences (list) –
Sequences to be aligned.
sequences
can be one of four things:- a FASTA-formatted string
- a list of BioPython
SeqRecord
objects - a list of AbTools
Sequence
objects - a list of lists/tuples, of the format
[sequence_id, sequence]
- alignment_file (str) – Path for the output alignment file. If not supplied,
a name will be generated using
tempfile.NamedTemporaryFile()
. - fasta (str) – Path to a FASTA-formatted file of sequences. Used as an
alternative to
sequences
when suppling a FASTA file. - fmt (str) – Format of the alignment. Options are ‘fasta’ and ‘clustal’. Default is ‘fasta’.
- threads (int) – Number of threads for MAFFT to use. Default is
-1
, which results in MAFFT usingmultiprocessing.cpu_count()
threads. - as_file (bool) – If
True
, returns a path to the alignment file. IfFalse
, returns a BioPythonMultipleSeqAlignment
object (obtained by callingBio.AlignIO.read()
on the alignment file). - maxiters (int) – Passed directly to MUSCLE using the
-maxiters
flag. - diags (int) – Passed directly to MUSCLE using the
-diags
flag. - gap_open (float) – Passed directly to MUSCLE using the
-gapopen
flag. Ignored ifgap_extend
is not also provided. - gap_extend (float) – Passed directly to MUSCLE using the
-gapextend
flag. Ignored ifgap_open
is not also provided.
Returns: - Returns a BioPython
MultipleSeqAlignment
object, unlessas_file
isTrue
, in which case the path to the alignment file is returned.
- sequences (list) –
-
abtools.alignment.
local_alignment
(query, target=None, targets=None, match=3, mismatch=-2, gap_open=-5, gap_extend=-2, matrix=None, aa=False, gap_open_penalty=None, gap_extend_penalty=None)¶ Striped Smith-Waterman local pairwise alignment.
Parameters: - query –
Query sequence.
query
can be one of four things:- a nucleotide or amino acid sequence, as a string
- a Biopython
SeqRecord
object - an AbTools
Sequence
object - a list/tuple of the format
[seq_id, sequence]
- target – A single target sequence.
target
can be anything thatquery
accepts. - targets (list) – A list of target sequences, to be proccssed iteratively.
Each element in the
targets
list can be anything accepted byquery
. - match (int) – Match score. Should be a positive integer. Default is 3.
- mismatch (int) – Mismatch score. Should be a negative integer. Default is -2.
- gap_open (int) – Penalty for opening gaps. Should be a negative integer. Default is -5.
- gap_extend (int) – Penalty for extending gaps. Should be a negative integer. Default is -2.
- matrix (str, dict) –
Alignment scoring matrix. Two options for passing the alignment matrix:
- The name of a built-in matrix. Current options are
blosum62
andpam250
. - A nested dictionary, giving an alignment score for each residue pair. Should be formatted
such that retrieving the alignment score for A and G is accomplished by:
matrix['A']['G']
- The name of a built-in matrix. Current options are
- aa (bool) – Must be set to
True
if aligning amino acid sequences. Default isFalse
.
Returns: If a single target sequence is provided (via
target
), a singleSSWAlignment
object will be returned. If multiple target sequences are supplied (viatargets
), a list ofSSWAlignment
objects will be returned.- query –
-
abtools.alignment.
global_alignment
(query, target=None, targets=None, match=3, mismatch=-2, gap_open=-5, gap_extend=-2, score_match=None, score_mismatch=None, score_gap_open=None, score_gap_extend=None, matrix=None, aa=False)¶ Needleman-Wunch global pairwise alignment.
With
global_alignment
, you can score an alignment using different paramaters than were used to compute the alignment. This allows you to compute pure identity scores (match=1, mismatch=0) on pairs of sequences for which those alignment parameters would be unsuitable. For example:seq1 = 'ATGCAGC' seq2 = 'ATCAAGC'
using identity scoring params (match=1, all penalties are 0) for both alignment and scoring produces the following alignment:
ATGCA-GC || || || AT-CAAGC
with an alignment score of 6 and an alignment length of 8 (identity = 75%). But what if we want to calculate the identity of a gapless alignment? Using:
global_alignment(seq1, seq2, gap_open=20, score_match=1, score_mismatch=0, score_gap_open=10, score_gap_extend=1)
we get the following alignment:
ATGCAGC || ||| ATCAAGC
which has an score of 5 and an alignment length of 7 (identity = 71%). Obviously, this is an overly simple example (it would be much easier to force gapless alignment by just iterating over each sequence and counting the matches), but there are several real-life cases in which different alignment and scoring paramaters are desirable.
Parameters: - query –
Query sequence.
query
can be one of four things:- a nucleotide or amino acid sequence, as a string
- a Biopython
SeqRecord
object - an AbTools
Sequence
object - a list/tuple of the format
[seq_id, sequence]
- target – A single target sequence.
target
can be anything thatquery
accepts. - targets (list) – A list of target sequences, to be proccssed iteratively.
Each element in the
targets
list can be anything accepted byquery
. - match (int) – Match score for alignment. Should be a positive integer. Default is 3.
- mismatch (int) – Mismatch score for alignment. Should be a negative integer. Default is -2.
- gap_open (int) – Penalty for opening gaps in alignment. Should be a negative integer. Default is -5.
- gap_extend (int) – Penalty for extending gaps in alignment. Should be a negative integer. Default is -2.
- score_match (int) – Match score for scoring the alignment. Should be a positive integer.
Default is to use the score from
match
ormatrix
, whichever is provided. - score_mismatch (int) – Mismatch score for scoring the alignment. Should be a negative
integer. Default is to use the score from
mismatch
ormatrix
, whichever is provided. - score_gap_open (int) – Gap open penalty for scoring the alignment. Should be a negative
integer. Default is to use
gap_open
. - score_gap_extend (int) – Gap extend penalty for scoring the alignment. Should be a negative
integer. Default is to use
gap_extend
. - matrix (str, dict) –
Alignment scoring matrix. Two options for passing the alignment matrix:
- The name of a built-in matrix. Current options are
blosum62
andpam250
. - A nested dictionary, giving an alignment score for each residue pair. Should be
formatted such that retrieving the alignment score for A and G is accomplished by:
matrix['A']['G']
- The name of a built-in matrix. Current options are
- aa (bool) – Must be set to
True
if aligning amino acid sequences. Default isFalse
.
Returns: If a single target sequence is provided (via
target
), a singleNWAlignment
object will be returned. If multiple target sequences are supplied (viatargets
), a list ofNWAlignment
objects will be returned.- query –
-
class
abtools.alignment.
BaseAlignment
(query, target, matrix, match, mismatch, gap_open, gap_extend, aa)¶ Base class for local and global pairwise alignments.
Note
All comparisons between
BaseAlignments
are done on thescore
attribute (which must be implemented by any classes that subclassBaseAlignment
). This was done so that sorting alignments like so:alignments = [list of alignments] alignments.sort(reverse=True)
results in a sorted list of alignments from the highest alignment score to the lowest.
-
query
¶ Sequence – The input query sequence, as an AbTools
Sequence
object.
-
target
¶ Sequence – The input target sequence, as an AbTools
Sequence
object.
-
target_id
¶ str – ID of the target sequence.
-
raw_query
¶ The raw query, before conversion to a
Sequence
.
-
raw_target
¶ The raw target, before conversion to a
Sequence
.
-
-
class
abtools.alignment.
SSWAlignment
(query, target, match=3, mismatch=-2, matrix=None, gap_open=5, gap_extend=2, aa=False)¶ Structure for performing and analyzing a Smith-Waterman local alignment.
-
alignment_type
¶ str – Is ‘local’ for all
SSWAlignment
objects.
-
aligned_query
¶ str – The aligned query sequence (including gaps).
-
aligned_target
¶ str – The aligned target sequence (including gaps).
-
alignment_midline
¶ str – Midline for the aligned sequences, with
|
indicating matches and a gap indicating mismatches:print(aln.aligned_query) print(aln.alignment_midline) print(aln.aligned_target) # ATGC # || | # ATCC
-
score
¶ int – Alignment score.
-
query_begin
¶ int – Position in the raw query sequence at which the optimal alignment begins.
-
query_end
¶ int – Position in the raw query sequence at which the optimal alignment ends.
-
target_begin
¶ int – Position in the raw target sequence at which the optimal alignment begins.
-
target_end
¶ int – Position in the raw target sequence at which the optimal alignment ends.
-
-
class
abtools.alignment.
NWAlignment
(query, target, match=3, mismatch=-2, gap_open=-5, gap_extend=-2, score_match=None, score_mismatch=None, score_gap_open=None, score_gap_extend=None, matrix=None, aa=False)¶ Structure for performing and analyzing a Needleman-Wunch global alignment.
-
alignment_type
¶ str – Is ‘global’ for all
NWAlignment
objects.
-
aligned_query
¶ str – The aligned query sequence (including gaps).
-
aligned_target
¶ str – The aligned target sequence (including gaps).
-
alignment_midline
¶ str – Midline for the aligned sequences, with
|
indicating matches and a gap indicating mismatches:print(aln.aligned_query) print(aln.alignment_midline) print(aln.aligned_target) # ATGC # || | # ATCC
-
score
¶ int – Alignment score.
-
query_begin
¶ int – Position in the raw query sequence at which the optimal alignment begins.
-
query_end
¶ int – Position in the raw query sequence at which the optimal alignment ends.
-
target_begin
¶ int – Position in the raw target sequence at which the optimal alignment begins.
-
target_end
¶ int – Position in the raw target sequence at which the optimal alignment ends.
-
abtools.cluster
: Sequence Clustering¶
-
class
abtools.cluster.
Cluster
(raw_cluster, seq_db=None, db_path=None, seq_dict=None)¶ Data and methods for a cluster of sequences.
All public attributes are evaluated lazily, so attributes that require significant processing time are only computed when needed. In addition, attributes are only calculated once, so if you change the Cluster object after accessing attributes, the attributes will not update. Setters are provided for all attributes, however, so you can update them manually if necessary:
seqs = [Sequence1, Sequence2, ... SequenceN] clust = cluster(seqs) # calculate the consensus consensus = clust.consensus # add sequences to the Cluster more_sequences = [SequenceA, SequenceB, SequenceC] clust.sequences += more_sequences # need to recompute the consensus manually clust.consensus = clust._make_consensus()
-
ids
¶ list – A list of all sequence IDs in the Cluster
-
size
¶ int – Number of sequences in the Cluster
-
sequences
¶ list – A list of all sequences in the Cluster, as AbTools
Sequence
objects.
-
consensus
¶ Sequence – Consensus sequence, calculated by aligning all sequences with MAFFT and computing the
Bio.Align.AlignInfo.SummaryInfo.gap_consensus()
-
centroid
¶ Sequence – Centroid sequence, as calculated by CD-HIT.
-
-
abtools.cluster.
cluster
(seqs, threshold=0.975, out_file=None, make_db=True, temp_dir=None, quiet=False, threads=0, return_just_seq_ids=False, max_memory=800, debug=False)¶ Perform sequence clustering with CD-HIT.
Parameters: - seqs (list) – An iterable of sequences, in any format that abtools.sequence.Sequence() can handle
- threshold (float) – Clustering identity threshold. Default is 0.975.
- out_file (str) – Path to the clustering output file. Default is to use tempfile.NamedTempraryFile to generate an output file name.
- temp_dir (str) – Path to the temporary directory. If not provided, ‘/tmp’ is used.
- make_db (bool) – Whether to build a SQlite database of sequence information. Required if you want to calculate consensus/centroid sequences for the resulting clusters or if you need to access the clustered sequences (not just sequence IDs) Default is True.
Returns: A list of Cluster objects, one per cluster.
Return type: list
abtools.log
: Logging¶
-
abtools.log.
setup_logging
(logfile, print_log_location=True, debug=False)¶ Set up logging using the built-in
logging
package.A stream handler is added to all logs, so that logs at or above
logging.INFO
level are printed to screen as well as written to the log file.Parameters: - logfile (str) – Path to the log file. If the parent directory does not exist, it will be created. Required.
- print_log_location (bool) – If
True
, the log path will be written to the log upon initialization. Default isTrue
. - debug (bool) – If true, the log level will be set to
logging.DEBUG
. IfFalse
, the log level will be set tologging.INFO
. Default isFalse
.
-
abtools.log.
get_logger
(name=None)¶ Get a logging handle.
As with
setup_logging
, a stream handler is added to the log handle.Parameters: name (str) – Name of the log handle. Default is None
.
abtools.pipeline
: Utilities for building pipelines of AbTools functions¶
-
abtools.pipeline.
initialize
(log_file, project_dir=None, debug=False)¶ Initializes an AbTools pipeline.
Initialization includes printing the AbTools splash, setting up logging, creating the project directory, and logging both the project directory and the log location.
Parameters: - log_file (str) – Path to the log file. Required.
- project_dir (str) – Path to the project directory. If not provided, the project directory won’t be created and the location won’t be logged.
- debug (bool) – If
True
, the logging level will be set tologging.DEBUG
. Default isFALSE
, which logs atlogging.INFO
.
Returns: logger
-
abtools.pipeline.
make_dir
(d)¶ Makes a directory, if it doesn’t already exist.
Parameters: d (str) – Path to a directory.
-
abtools.pipeline.
list_files
(d, extension=None)¶ Lists files in a given directory.
Parameters: - d (str) – Path to a directory.
- extension (str) – If supplied, only files that contain the
specificied extension will be returned. Default is
False
, which returns all files ind
.
Returns: A sorted list of file paths.
Return type: list
abtools.s3
: Backup data to S3¶
-
abtools.s3.
compress_and_upload
(data, compressed_file, s3_path, multipart_chunk_size_mb=500, method='gz', delete=False, access_key=None, secret_key=None)¶ Compresses data and uploads to S3.
S3 upload uses
s3cmd
, so you must either:- Manually configure
s3cmd
prior to use (typically usings3cmd --configure
). - Configure
s3cmd
usings3.configure()
. - Pass your access key and secret key to
compress_and_upload
, which will automatically configure s3cmd.
Parameters: - data –
Can be one of three things:
- Path to a single file
- Path to a directory
- A list of one or more paths to files or directories
- compressed_file (str) – Path to the compressed file. Required.
- s3_path (str) –
The S3 path, with the filename omitted. The S3 filename will be the basename of the
compressed_file
. For example:compress_and_upload(data='/path/to/data', compressed_file='/path/to/compressed.tar.gz', s3_path='s3://my_bucket/path/to/')
will result in an uploaded S3 path of
s3://my_bucket/path/to/compressed.tar.gz
- method (str) – Compression method. Options are
'gz'
(gzip) or'bz2'
(bzip2). Default is'gz'
. - delete (bool) – If
True
, thecompressed_file
will be deleted after upload to S3. Default isFalse
. - access_key (str) – AWS access key.
- secret_key (str) – AWS secret key.
- Manually configure
-
abtools.s3.
put
(f, s3_path, multipart_chunk_size_mb=500, logger=None)¶ Uploads a single file to S3, using s3cmd.
Parameters: - f (str) – Path to a single file.
- s3_path (str) –
The S3 path, with the filename omitted. The S3 filename will be the basename of the
f
. For example:put(f='/path/to/myfile.tar.gz', s3_path='s3://my_bucket/path/to/')
will result in an uploaded S3 path of
s3://my_bucket/path/to/myfile.tar.gz
-
abtools.s3.
compress
(d, output, compress='gz', logger=None)¶ Creates a compressed/uncompressed tar file.
Parameters: - d –
Can be one of three things:
- the path to a single file, as a string
- the path to a single directory, as a string
- an iterable of file or directory paths
- output (str) – Output file path.
- compress – Compression method. Options are
'gz'
(gzip),'bz2'
(bzip2) and'none'
(uncompressed). Default is'gz'
.
- d –
-
abtools.s3.
configure
(access_key=None, secret_key=None, logger=None)¶ Configures s3cmd prior to first use.
If no arguments are provided, you will be prompted to enter the access key and secret key interactively.
Parameters: - access_key (str) – AWS access key
- secret_key (str) – AWS secret key
abtools.sequence
: Sequence utilities¶
-
class
abtools.sequence.
Sequence
(seq, id=None, qual=None, id_key='seq_id', seq_key='vdj_nt')¶ Container for biological (RNA and DNA) sequences.
seq
can be one of several things:- a raw sequence, as a string
- an iterable, formatted as
[seq_id, sequence]
- a dict, containing at least the ID (default key = ‘seq_id’) and a
sequence (default key = ‘vdj_nt’). Alternate
id_key
andseq_key
can be provided at instantiation. - a Biopython
SeqRecord
object - an AbTools
Sequence
object
If
seq
is provided as a string, the sequence ID can optionally be provided viaid
. Ifseq
is a string andid
is not provided, a random sequence ID will be generated withuuid.uuid4()
.Quality scores can be supplied with
qual
or as part of aSeqRecord
object. If providing both a SeqRecord object with quality scores and quality scores viaqual
, thequal
scores will override the SeqRecord quality scores.If
seq
is a dictionary, typically the result of a MongoDB query, the dictionary can be accessed directly from theSequence
instance. To retrive the value for'junc_aa'
in the instantiating dictionary, you would simply:s = Sequence(dict) junc = s['junc_aa']
If
seq
is a dictionary, an optionalid_key
andseq_key
can be provided, which tells theSequence
object which field to use to populateSequence.id
andSequence.sequence
. Defaults areid_key='seq_id'
andseq_key='vdj_nt'
.Alternately, the
__getitem__()
interface can be used to obtain a slice from thesequence
attribute. An example of the distinction:d = {'name': 'MySequence', 'sequence': 'ATGC'} seq = Sequence(d, id_key='name', seq_key='sequence') seq['name'] # 'MySequence' seq[:2] # 'AT'
If the
Sequence
is instantiated with a dictionary, calls to__contains__()
will returnTrue
if the supplied item is a key in the dictionary. In non-dict instantiations,__contains__()
will look in theSequence.sequence
field directly (essentially a motif search). For example:dict_seq = Sequence({'seq_id': 'seq1', 'vdj_nt': 'ACGT'}) 'seq_id' in dict_seq # TRUE 'ACG' in dict_seq # FALSE str_seq = Sequence('ACGT', id='seq1') 'seq_id' in str_seq # FALSE 'ACG' in str_seq # TRUE
Note
When comparing
Sequence
objects, they are comsidered equal only if their sequences and IDs are identical. This means that twoSequence
objects with identical sequences but without user-supplied IDs won’t be equal, because their IDs will have been randomly generated.-
fasta
¶ str – Returns the sequence, as a FASTA-formatted string
Note: The FASTA string is built using
Sequence.id
andSequence.sequence
.
-
fastq
¶ str – Returns the sequence, as a FASTQ-formatted string
If
Sequence.qual
isNone
, thenNone
will be returned instead of a FASTQ string
-
reverse_complement
¶ str – Returns the reverse complement of
Sequence.sequence
.
-
region
(start=0, end=None)¶ Returns a region of
Sequence.sequence
, in FASTA format.If called without kwargs, the entire sequence will be returned.
Parameters: - start (int) – Start position of the region to be returned. Default is 0.
- end (int) – End position of the region to be returned. Negative values will function as they do when slicing strings.
Returns: A region of
Sequence.sequence
, in FASTA formatReturn type: str
Secondary Annotation¶
abtools.compare
: Repertoire-level comparison¶
-
abtools._compare.
aggregate
(data)¶ Counts the number of occurances of each item in ‘data’.
Input data: a list of values.
Output a dict of bins and counts.
-
abtools._compare.
mh_similarity
(sample1, sample2)¶ Calculates the Marista-Horn similarity for two samples.
Parameters: - sample1 – list of frequencies for sample 1
- sample2 – list of frequencies for sample 2
Returns: Marista-Horn similarity (between 0 and 1)
Return type: float
-
abtools._compare.
kl_divergence
(s1, s2)¶ Calculates the Kullback-Leibler divergence for two samples.
Parameters: - sample1 – probability distribution for sample 1
- sample2 – probability distribution for sample 2
Returns: Kullbeck-Leibler similarity
Return type: float
-
abtools._compare.
js_similarity
(s1, s2)¶ Calculates the Jensen-Shannon similarity for two samples.
Parameters: - sample1 – probability distribution for sample 1
- sample2 – probability distribution for sample 2
Returns: Jensen-Shannon similarity (between 0 and 1)
Return type: float
-
abtools._compare.
shannon_entropy
(prob_dist)¶ Calculates the Shannon entropy for a single probability distribution.
Parameters: prob_dist – probability distribution, must sum to 1 Returns: Shannon entropy Return type: float
-
abtools._compare.
jaccard_similarity
(s1, s2)¶ Calculates the Jaccard similarity for two samples.
Parameters: - sample1 – list of frequencies for sample 1
- sample2 – list of frequencies for sample 2
Returns: Jaccard similarity (between 0 and 1)
Return type: float
-
abtools._compare.
renkonen_similarity
(s1, s2)¶ Calculates the Renkonen similarity (also known as the percentage similarity) for two samples.
Parameters: - s1 – probability distribution for sample 1
- s2 – probability distribution for sample 2
Returns: Renkonen similarity (between 0 and 1)
Return type: float
-
abtools._compare.
bc_similarity
(s1, s2)¶ Calculates the Bray-Curtis similarity for two samples.
Parameters: - s1 – probability distribution for sample 1
- s2 – probability distribution for sample 2
Returns: Bray-Curtis similarity (between 0 and 1)
Return type: float
-
abtools._compare.
cosine_similarity
(s1, s2)¶ Calculates the cosine (angular) similarity for two samples.
Parameters: - s1 – list of frequencies for sample 1
- s2 – list of frequencies for sample 2
Returns: Cosine similarity (between 0 and 1)
Return type: float
-
abtools._compare.
sd_similarity
(s1, s2)¶ Calculates the Brey-Curtis similarity for two samples.
Parameters: - s1 – list of frequencies for sample 1
- s2 – list of frequencies for sample 2
Results:
float: Brey-Curtis similarity (between 0 and 1)
-
abtools._compare.
run
(**kwargs)¶ Performs repertoire-level comparison of antibody sequencing datasets.
Currently, the only metric for comparison is V-gene usage frequency. Additional measures are in the works (such as comparisons based on clonality).
Parameters: - db (str) – MongoDB database name.
- collection1 (str) – Name of the first MongoDB collection to query for comparison.
If both
collection1
andcollection2
are provided,collection1
will be compared only tocollection2
. If neithercollection1
norcollection2
are provided, all collections indb
will be processed iteratively (all pairwise comparisons will be made). Ifcollection1
is provided butcollection2
is not,collection1
will be iteratively compared to all other collections indb
. - collection2 (str) – Name of the second MongoDB collection to query for comparison.
If both
collection1
andcollection2
are provided,collection1
will be compared only tocollection2
. If neithercollection1
norcollection2
are provided, all collections indb
will be processed iteratively (all pairwise comparisons will be made). - collection_prefix (str) – All collections beginning with
collection_prefix
will be iteratively compared (all pairwise comparisons will be made). - ip (str) – IP address of the MongoDB server. Default is
localhost
. - port (int) – Port of the MongoDB server. Default is
27017
. - user (str) – Username with which to connect to the MongoDB database. If either
of
user
orpassword
is not provided, the connection to the MongoDB database will be attempted without authentication. - password (str) – Password with which to connect to the MongoDB database. If either
of
user
orpassword
is not provided, the connection to the MongoDB database will be attempted without authentication. - chunksize (int) – Number of sequences for each iteration. Default is 100,000.
- iterations (int) – Number of iterations to perform on each pair of samples. Default is 10,000
- method (str) –
Similarity/divergence method to used for comparison. Default is
marisita-horn
. Options are:marisita-horn
kullback-leibler
jensen-shannon
jaccard
bray-curtis
renkonen
cosine
- control_similarity (bool) – If
True
, control similarity/divergence will be calculated, in which each sample is also compared to itself. Default isFalse
. - chain (str) – Antibody chain to be used for comparison. Options are
heavy
,kappa
andlambda
. Default isheavy
.
abtools.correct
: PCR and sequencing error correction¶
abtools.finder
: Mine NGS datasets for similarity to known mAbs¶
-
abtools._finder.
chunker
(l, n)¶ Generator that produces n-length chunks from iterable l.
-
abtools._finder.
run
(**kwargs)¶ Mines NGS datasets for identity to known antibody sequences.
All of
db
,output
,temp
andstandard
are required.Parameters: - db (str) – Name of a MongoDB database to query.
- collection (str) – Name of a MongoDB collection. If not provided, all collections
in
db
will be processed iteratively. - output_dir (str) – Path to the output directory, into which identity/divergence figures will be deposited.
- temp_dir (str) – Path to a temporary directory.
- log (str) – Path to a log file. If not provided, log information will not be retained.
- ip (str) – IP address of the MongoDB server. Default is
localhost
. - port (str) – Port of the MongoDB server. Default is
27017
. - user (str) – Username with which to connect to the MongoDB database. If either
of
user
orpassword
is not provided, the connection to the MongoDB database will be attempted without authentication. - password (str) – Password with which to connect to the MongoDB database. If either
of
user
orpassword
is not provided, the connection to the MongoDB database will be attempted without authentication. - standard (path) – Path to a FASTA-formatted file containing one or more ‘standard’ sequences, against which the NGS sequences will be compared.
- chain (str) – Antibody chain. Choices are ‘heavy’, ‘kappa’, ‘lambda’, and ‘light’.
Default is ‘heavy’. Only NGS sequences matching
chain
(with ‘light’ covering both ‘kappa’ and ‘lambda’) will be compared to thestandard
sequences. - update (bool) – If
True
, the MongoDB record for each NGS sequence will be updated with identity information for each standard. IfFalse
, the updated is skipped. Default isTrue
. - is_aa (bool) – If
True
, thestandard
sequences are amino acid sequences. IfFalse
, they are nucleotide seqeunces. Default isFalse
. - x_min (int) – Minimum x-axis value on identity/divergence plots.
- x_max (int) – Maximum x-axis value on identity/divergence plots.
- y_min (int) – Minimum y-axis value on identity/divergence plots.
- y_max (int) – Maximum y-axis value on identity/divergence plots.
- gridsize (int) – Relative size of hexbin grids.
- mincount (int) – Minimum number of sequences in a hexbin for the bin to be colored. Default is 3.
- colormap (str, colormap) – Colormap to be used for identity/divergence plots.
Default is
Blues
. - debug (bool) – If
True
, more verbose logging.
abtools.phylogeny
: Phylogenetic analysis of antibody lineages¶
-
abtools._phylogeny.
run
(**kwargs)¶ Builds a phylogenetic representation of antibody sequences.
output
is required, as well as one ofinput
,alignment
ornewick
.Parameters: - input (str) –
Can be one of three things:
- Path to a FASTA-formatted file containing input sequences.
- A list of AbTools
Sequence
objects. - A list of dictionaries, containing at minimum
name_key
andseq_key
.
- output (str) – Path to the output directory, into which tree images and all intermediate files will be deposited.
- root (str) – Path to a FASTA-formatted file containing a single sequence which will be used to root the tree. If not provided, tree will be unrooted.
- mabs (str) – Path to a FASTA-formatted file containing mAb sequences. If supplying both mAb sequences and NGS sequences, passing the mAb sequences separately allows you to modify their representation separately (for example, show sequence IDs for just the mAb sequences).
- alignment (str) – Path to a multiple sequence alignment, in FASTA format. If sequences are already aligned, this will save some computational time since the alignment will not be redone.
- newick (str) – Path to a tree file, in Newick format. As with
alignment
, this is primarily to save computational time if the tree file has already been generated. - name_key (str) – If
input
is a list of Sequence objects or dicts, this key will be used to find the sequence ID. Default isseq_id
. - sequence_key (str) – If
input
is a list of Sequence objects or dicts, this key will be used to find the sequence. Default isvdj_nt
. - timepoints (str) –
Path to a Tab-delimited file, of the following format (one per line):
TimepointName TimepointOrder TimepointColor
TimepointName
should prepended to the sequences in the input file (separated bydelimiter
).TimepointOrder
is an integer that indicates the order in which the timepoints should be sorted.TimepointColor
is a hex value that will be used to color the phylogenetic tree. If mAb sequences are provided, the ‘mab’TimepointName
will be used to sort/color the mAb sequences. If not provided, colors will be automatically selected and timepoints will be determined by a simple sort of the raw timepoint values parsed from the input file. - is_aa (bool) – If
True
, input sequences will be assumed to be amino acid sequences. Default isFalse
, which assumes nucleotide sequences. - delimiter (str) – The delimiter used in sequence IDs to separate the timepoint from
the sequence name. Default is
_
. - scale (int) – Horizontal scale of the phylogeny. Default is
None
, which uses the defaultete2
value. - branch_vertical_margin (float) – Vertical scale of the phylogeny. Default is
None
, which uses the defaultete2
value. - label_nodes (str) – Type of nodes to be labeled. Options are:
all
,none
,no-root
,mab
,input
, androot
. - label_fontsize (float) – Font size for the node labels.
- tree_orientation (int) – If
0
, tree is drawn from left to right. If1
, tree will be drawn from right to left (mirror). Default is0
.
- input (str) –
About¶
License¶
The MIT License (MIT)
Copyright (c) 2016 Bryan Briney
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.