Welcome to parabam¶
Author: | Henry Farmery |
---|---|
Date: | August 16, 2017 |
Version: | 0.1 |
parabam is a python package that allows users to quickly and easily conduct analysis on BAM files. parabam is designed to allow the user to make maximum useage of the processing power available to their machines.
Running an analysis on a BAM file can be as easy as specifiying a single function and then typing a command into on your terminal. More advanced users may incorporate parabam into their own code or expand parabam by making using of OOP inheritence.
Contents:
Preface - Quickstart¶
For those who don’t want to read the docs before jumping in, try the code snippets below to get going with parabam as quickly as possible.
First open your terminal and install parabam:
sudo pip install parabam
Then in your terminal create a directory to run your parabam analysis on:
cd ~
mkdir parabam-analysis
cd parabam-analysis
Now let’s create an instruction file for our analysis. Create the file:
touch instruction.py
Open the instruction.py file in your favourite text editor and paste this code into your new file:
def rule(read,constants,master):
if "ATCGATCG" in read.seq:
return True
return False
Now locate the BAM file you wish to run parabam analysis on. Then type the following into your terminal command line:
parabam subset -v2 -i instructions -b /path/to/bam/file_name.bam
After a while (depending on how large your input BAM file was) parabam will be finished and you will be left with a result in <file_name>_subset.bam.
Congratulations - you just ran parabam subset. Read on to find out more.
Introduction¶
So, you’ve got this far through the documentation which means you are at least reasonably curious about parabam. Great!
The goal of parabam is to allow the user to inspect large BAM files in a timely manner whilst making the most of their computational resources and without having to write too much code. The most direct way to use parabam is via the command line, but it can also be invoked programaticaly via interface classes, and also support full incorporation via OOP inheritance.
If any of the words in the pragraph above don’t make sense: don’t worry. Parabam is easy to run for beginniners. All you need is to know a little bit of Unix (how to make and change directories) and a little bit of Python. Just follow the instructions and you’ll be conducting high-throughput analysis in no time. If you have any questions, feel free to email me.
Alternatively, if you understood every word of the above and want to get more in-depth with parabam then pay close attention to the programmatic and inheritance based interfaces.
Installing parabam¶
First, we have to get parabam running on your computer. This section will guide you through installing parabam.
Currently, parabam is only available to POSIX compliant systems (Linux, Mac OSX, etc) and not Windows.
The easy way¶
The easiest way to install parabam is as follows:
sudo pip install parabam
This will also place an executable in a location on your path. So after typing this command you should be able to do the following:
>>> parabam
>>> parabam
----------------------------------------------------------------
About:
Parabam - analyse bam file in parallel
Usage:
parabam <command> [options] instruction:{Python} input:{BAM} output:{BAM/CSV}
Command:
stat Genereate stats regarding the BAM file
subset Create a subsetted BAM file
If typing parbam into your terminal yields the result above, you have succesfully installed parabam.
The manual way¶
Some users may wish to install from source. The source may be downloaded from here. This part of the documentation isn’t written yet, but if you want to do this you probably know how to do it anyway.
The precompiled way¶
If you really can’t be bothered, then perhap you could try some of the precompiled binaries listed here.
Simply copy these to a location on your path and type:
It should work out of the box if you picked the right binary.
Running parabam on a cluster¶
Some users may wish to run parabam on a high powered computing service like a cluster. On systems such as these you may not have root access, so you may not be able to install to the Python system directories. In these cases you may wish to use parabam within a virtualenv.
Fundamentals of parabam¶
Here are some fundamental concepts relating to parabam that will make interacting with parabam easier.
The subset and stat operations¶
Currently parabam allows you to conduct two separate “operations”. These are subset and stat.
The subset operation allows the user to create BAMs containing a specific subset of reads from within a given BAM file. For example, a subset BAM containing all of the unmapped reads in the file, or all the reads which contain a certain sequence.
The stat operation allows the user to collect data concerning the reads in a given BAM file. These data are collected and output in a .csv file at the end of operation. Data can be collected as an int, float or NumPy array. For instance, the user may wish to collect the amount of reads in a file that contain a specific sequence, or statistics like the average insert size of a BAM file.
Ways of interacting with parabam¶
There are several ways to interact with parabam. The most direct way is to write a short instruction script in Python and then run this on the command line. For details on how to write instructions files for the various parabam operations consult the subset and stat sections.
Users may also incorporate parabam into their own programs. This is as simple as importing the relevant Interface class for your intended operation (subset or stat) and then providing the relevant arguments.
Examples for both of these methods may be found on the subset examples and stat examples pages.
Engines and constants¶
parabam works by applying a rule (a user defined function) to every read in the BAM file. Engines are simple functions that only take three arguments. These are:
- read - A sequencing read. This is an object of type pysam.AlignedSegment
- constants - A set of user defined constants. This is a simple python dict.
- master - An object which represents the sample BAM file from which the read originated. This is an object of type pysam.AlignmentFile
For examples which show in detail how to create rules for both subset and stat operations, see the subset and stat sections.
Output Files¶
Certain invocations of parabam will create many output files, particularly runs using multiple statistics.
Pysam classes: read and master¶
parabam uses the package pysam to process BAM files. As such the objects we interact with are pysam classes.
The most prevalent examples of pysam objects are the read and master arguments when defining a rule. The read is of type AlignedSegment and master is of type AlignmentFile. Both of these are pysam classes. Their documentation is reproduced here for reference. Pysam is entirely the work of Andreas Heger and contributors.
Please Note: Users of parabam don’t need to know a great deal about pysam classes (most of this is taken care of behind the scenes), but when writing rules it will help to know the different variables and functions that you may reference with the read and master arguments.
read (AlignedSegment)¶
Any of the following functions or variables may be referenced when dealing with the read
- class pysam.AlignedSegment¶
AlignedSegment() Class representing an aligned segment.
This class stores a handle to the samtools C-structure representing an aligned read. Member read access is forwarded to the C-structure and converted into python objects. This implementation should be fast, as only the data needed is converted.
For write access, the C-structure is updated in-place. This is not the most efficient way to build BAM entries, as the variable length data is concatenated and thus needs to be resized if a field is updated. Furthermore, the BAM entry might be in an inconsistent state.
One issue to look out for is that the sequence should always be set before the quality scores. Setting the sequence will also erase any quality scores that were set previously.
- __hash__¶
x.__hash__() <==> hash(x)
- __str__¶
return string representation of alignment.
The representation is an approximate sam format.
An aligned read might not be associated with a AlignmentFile. As a result tid is shown instead of the reference name.
Similarly, the tags field is returned in its parsed state.
- bin¶
properties bin
- cigarstring¶
the cigar alignment as a string.
The cigar string is a string of alternating integers and characters denoting the length and the type of an operation.
Note
The order length,operation is specified in the SAM format. It is different from the order of the cigar property.
Returns None if not present.
To unset the cigarstring, assign None or the empty string.
- cigartuples¶
the cigar alignment. The alignment is returned as a list of tuples of (operation, length).
If the alignment is not present, None is returned.
The operations are:
M BAM_CMATCH 0 I BAM_CINS 1 D BAM_CDEL 2 N BAM_CREF_SKIP 3 S BAM_CSOFT_CLIP 4 H BAM_CHARD_CLIP 5 P BAM_CPAD 6 = BAM_CEQUAL 7 X BAM_CDIFF 8 Note
The output is a list of (operation, length) tuples, such as [(0, 30)]. This is different from the SAM specification and the cigarstring property, which uses a (length, operation) order, for example: 30M.
To unset the cigar property, assign an empty list or None.
- compare(self, AlignedSegment other)¶
return -1,0,1, if contents in this are binary <,=,> to other
- flag¶
properties flag
- get_aligned_pairs(self)¶
a list of aligned read and reference positions.
- get_blocks(self)¶
a list of start and end positions of aligned gapless blocks.
The start and end positions are in genomic coordinates.
Blocks are not normalized, i.e. two blocks might be directly adjacent. This happens if the two blocks are separated by an insertion in the read.
- get_overlap(self, uint32_t start, uint32_t end)¶
return number of aligned bases of read overlapping the interval start and end on the reference sequence.
Return None if cigar alignment is not available.
- get_reference_positions(self, full_length=False)¶
a list of reference positions that this read aligns to.
By default, this method only returns positions in the reference that are within the alignment. If full_length is set, None values will be included for any soft-clipped or unaligned positions within the read. The returned list will thus be of the same length as the read.
- infer_query_length(self, always=True)¶
inferred read length from CIGAR string.
If always is set to True, the read length will be always inferred. If set to False, the length of the read sequence will be returned if it is available.
Returns None if CIGAR string is not present.
- is_duplicate¶
true if optical or PCR duplicate
- is_paired¶
true if read is paired in sequencing
- is_proper_pair¶
true if read is mapped in a proper pair
- is_qcfail¶
true if QC failure
- is_read1¶
true if this is read1
- is_read2¶
true if this is read2
- is_reverse¶
true if read is mapped to reverse strand
- is_secondary¶
true if not primary alignment
- is_unmapped¶
true if read itself is unmapped
- mapping_quality¶
mapping quality
- mate_is_reverse¶
true is read is mapped to reverse strand
- mate_is_unmapped¶
true if the mate is unmapped
- next_reference_id¶
the reference id of the mate/next read.
- next_reference_start¶
the position of the mate/next read.
- opt(self, tag)¶
retrieves optional data given a two-letter tag
- overlap(self)¶
- query_aligment_length¶
length of the aligned query sequence.
This is equal to qend - qstart
- query_alignment_end¶
end index of the aligned query portion of the sequence (0-based, exclusive)
- query_alignment_length¶
length of the query template. This includes soft-clipped bases and is equal to len(seq).
This property is read-only.
Returns 0 if not available.
- query_alignment_qualities¶
aligned query sequence quality values (None if not present). These are the quality values that correspond to query, that is, they exclude qualities of soft clipped bases. This is equal to qual[qstart:qend].
Quality scores are returned as a python array of unsigned chars. Note that this is not the ASCII-encoded value typically seen in FASTQ or SAM formatted files. Thus, no offset of 33 needs to be subtracted.
This property is read-only.
- query_alignment_sequence¶
aligned portion of the read.
This is a substring of seq that excludes flanking bases that were soft clipped (None if not present). It is equal to seq[qstart:qend].
SAM/BAM files may include extra flanking bases that are not part of the alignment. These bases may be the result of the Smith-Waterman or other algorithms, which may not require alignments that begin at the first residue or end at the last. In addition, extra sequencing adapters, multiplex identifiers, and low-quality bases that were not considered for alignment may have been retained.
- query_alignment_start¶
start index of the aligned query portion of the sequence (0-based, inclusive).
This the index of the first base in seq that is not soft-clipped.
- query_length¶
the length of the query/read.
This value corresponds to the length of the sequence supplied in the BAM/SAM file. The length of a query is 0 if there is no sequence in the BAM/SAM file. In those cases, the read length can be inferred from the CIGAR alignment, see pysam.AlignmentFile.infer_query_length.().
This property can be set by providing a sequence.
- query_name¶
the query template name (None if not present)
- query_qualities¶
read sequence base qualities, including soft clipped bases (None if not present).
Quality scores are returned as a python array of unsigned chars. Note that this is not the ASCII-encoded value typically seen in FASTQ or SAM formatted files. Thus, no offset of 33 needs to be subtracted.
Note that to set quality scores the sequence has to be set beforehand as this will determine the expected length of the quality score array.
This method raises a ValueError if the length of the quality scores and the sequence are not the same.
- query_sequence¶
read sequence bases, including soft clipped bases (None if not present).
Note that assigning to seq will invalidate any quality scores. Thus, to in-place edit the sequence and quality scores, copies of the quality scores need to be taken. Consider trimming for example:
q = read.qual read.seq = read.seq[5:10] read.qual = q[5:10]
The sequence is returned as it is stored in the BAM file. Some mappers might have stored a reverse complement of the original read sequence.
- reference_end¶
aligned reference position of the read on the reference genome.
aend points to one past the last aligned residue. Returns None if not available.
- reference_id¶
reference ID
Note
This field contains the index of the reference sequence in the sequence dictionary. To obtain the name of the reference sequence, use pysam.AlignmentFile.getrname()
- reference_length¶
aligned length of the read on the reference genome.
This is equal to aend - pos. Returns None if not available.
- reference_start¶
0-based leftmost coordinate
- setTag(self, tag, value, value_type=None, replace=True)¶
Set optional field of alignment tag to value. value_type may be specified, but if not the type will be inferred based on the Python type of value
An existing value of the same tag will be overwritten unless replace is set to False.
the tags in the AUX field.
This property permits convenience access to the tags. Changes it the returned list will not update the tags automatically. Instead, the following is required for adding a new tag:
read.tags = read.tags + [("RG",0)]
This method will happily write the same tag multiple times.
- template_length¶
the observed query template length
master (AlignmentFile)¶
Any of the following functions or variables may be referenced when dealing with the master
- class pysam.AlignmentFile¶
- *(filename, mode=None, template = None,
- referencenames=None, referencelengths = None, text=NULL, header=None, add_sq_text=False, check_header=True, check_sq=True)*
A SAM/BAM formatted file. The file is automatically opened.
mode should be r for reading or w for writing. The default is text mode (SAM). For binary (BAM) I/O you should append b for compressed or u for uncompressed BAM output. Use h to output header information in text (TAM) mode.
If b is present, it must immediately follow r or w. Valid modes are r, w, wh, rb, wb, wbu and wb0. For instance, to open a BAM formatted file for reading, type:
f = pysam.AlignmentFile('ex1.bam','rb')
If mode is not specified, we will try to auto-detect in the order ‘rb’, ‘r’, thus both the following should work:
f1 = pysam.AlignmentFile('ex1.bam') f2 = pysam.AlignmentFile('ex1.sam')
If an index for a BAM file exists (.bai), it will be opened automatically. Without an index random access to reads via fetch() and pileup() is disabled.
For writing, the header of a SAM file/BAM file can be constituted from several sources (see also the samtools format specification):
- If template is given, the header is copied from a another AlignmentFile (template must be of type AlignmentFile).
- If header is given, the header is built from a multi-level dictionary. The first level are the four types (‘HD’, ‘SQ’, ...). The second level are a list of lines, with each line being a list of tag-value pairs. The header is constructed first from all the defined fields, followed by user tags in alphabetical order.
- If text is given, new header text is copied from raw text.
- The names (referencenames) and lengths (referencelengths) are supplied directly as lists. By default, ‘SQ’ and ‘LN’ tags will be added to the header text. This option can be changed by unsetting the flag add_sq_text.
By default, if a file is opened in mode ‘r’, it is checked for a valid header (check_header = True) and a definition of chromosome names (check_sq = True).
- __enter__(self)¶
- __exit__(self, exc_type, exc_value, traceback)¶
- __iter__¶
x.__iter__() <==> iter(x)
- __next__()¶
python version of next().
- close(self)¶
closes the pysam.AlignmentFile.
- count(self, reference=None, start=None, end=None, region=None, until_eof=False)¶
(reference = None, start = None, end = None, region = None, callback = None, until_eof = False)
count reads region using 0-based indexing. The region is specified by reference, start and end. Alternatively, a samtools region string can be supplied.
Note that a SAM file does not allow random access. If region or reference are given, an exception is raised.
- fetch(self, reference=None, start=None, end=None, region=None, tid=None, callback=None, until_eof=False, multiple_iterators=False)¶
fetch aligned reads in a region using 0-based indexing. The region is specified by reference, start and end. Alternatively, a samtools region string can be supplied.
Without reference or region all mapped reads will be fetched. The reads will be returned ordered by reference sequence, which will not necessarily be the order within the file.
If until_eof is given, all reads from the current file position will be returned in order as they are within the file. Using this option will also fetch unmapped reads.
Set multiple_iterators to true if you will be using multiple iterators on the same file at the same time. The iterator returned will receive its own copy of a filehandle to the file effectively re-opening the file. Re-opening a file creates some overhead, so beware.
If only reference is set, all reads aligned to reference will be fetched.
Note that a SAM file does not allow random access. If region or reference are given, an exception is raised.
- filename¶
number of filename associated with this object.
- getrname(self, tid)¶
convert numerical tid into reference name.
- gettid(self, reference)¶
convert reference name into numerical tid
returns -1 if reference is not known.
- head(self, n, multiple_iterators=True)¶
return iterator over the first n alignments.
This is useful for inspecting the bam-file.
multiple_iterators is set to True by default in order to avoid changing the current file position.
- header¶
header information within the sam file. The records and fields are returned as a two-level dictionary.
The first level contains the record (HD, SQ, etc) and the second level contains the fields (VN, LN, etc).
The parser is validating and will raise an AssertionError if if encounters any record or field tags that are not part of the SAM specification. Use the pysam.AlignmentFile.text attribute to get the unparsed header.
The parsing follows the SAM format specification with the exception of the CL field. This option will consume the rest of a header line irrespective of any additional fields. This behaviour has been added to accommodate command line options that contain characters that are not valid field separators.
- lengths¶
tuple of the lengths of the reference sequences. The lengths are in the same order as pysam.AlignmentFile.references
- mapped¶
total number of mapped alignments in file.
- mate(self, AlignedSegment read)¶
return the mate of AlignedSegment read.
Throws a ValueError if read is unpaired or the mate is unmapped.
Note
Calling this method will change the file position. This might interfere with any iterators that have not re-opened the file.
Note
This method is too slow for high-throughput processing. If a read needs to be processed with its mate, work from a read name sorted file or, better, cache reads.
- next¶
x.next() -> the next value, or raise StopIteration
- nocoordinate¶
total number of reads without coordinates
- nreferences¶
number of reference sequences in the file.
- pileup(self, reference=None, start=None, end=None, region=None, **kwargs)¶
perform a pileup within a region. The region is specified by reference, start and end (using 0-based indexing). Alternatively, a samtools region string can be supplied.
Without reference or region all reads will be used for the pileup. The reads will be returned ordered by reference sequence, which will not necessarily be the order within the file.
The method returns an iterator of type pysam.IteratorColumn unless a callback is provided. If a *callback is given, the callback will be executed for each column within the region.
Note that SAM formatted files do not allow random access. In these files, if a region or reference are given an exception is raised.
Optional kwargs to the iterator:
- stepper
The stepper controlls how the iterator advances. Possible options for the stepper are
- all
- use all reads for pileup.
- pass
- skip reads in which any of the following flags are set: BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL, BAM_FDUP
- samtools
- same filter and read processing as in csamtools pileup. This requires a fastafile to be given.
- fastafile
- A FastaFile object. This is required for some of the steppers.
- mask
- Skip all reads with bits set in mask if mask=True.
- max_depth
- Maximum read depth permitted. The default limit is 8000.
truncate
By default, the samtools pileup engine outputs all reads overlapping a region (see note below). If truncate is True and a region is given, only output columns in the exact region specificied.Note
all reads which overlap the region are returned. The first base returned will be the first base of the first read not necessarily the first base of the region used in the query.
- references¶
tuple with the names of reference sequences.
- reset(self)¶
reset file position to beginning of file just after the header.
- seek(self, uint64_t offset, int where=0)¶
move file pointer to position offset, see pysam.AlignmentFile.tell().
- tell(self)¶
return current file position.
- text¶
full contents of the sam file header as a string
See pysam.AlignmentFile.header to get a parsed representation of the header.
- unmapped¶
total number of unmapped reads in file.
- write(self, AlignedSegment read) → int¶
write a single pysam.AlignedSegment to disk.
returns the number of bytes written.
The subset interface¶
What is the subset interface for?¶
The subset operation allows the user to create BAMs containing a specific subset of reads from within a given BAM file. For example, a subset BAM containing all of the unmapped reads in the file, or all the reads which contain a certain sequence.
In this chapter we detail the process you must follow in order to run the subset operation.
Producing a single subset¶
With parabam it is easy to create BAM files containing a subset of reads from a larger BAM file. In this section we will explore how to create a single subset with parabam. With single subsets there is only one criterion by which reads are stored. This makes for simple code, but if you want to create multiple subsets you should read the section detailing multiple subsets.
From the command line¶
Here is a run-though illustrating how we would create a parabam run that collects unmapped reads from a file.
In order to use the subset interface from the command line we first navigate to a directory where we will conduct our analysis.
cd /analysis/parabam
We then make a new file which will contain the code which parabam will use to analyse the BAM file. Let’s make our instructions file:
touch get_unmapped_reads.py
Now we edit the file we just created - get_unmapped_reads.py - to include the following code.
def rule(read,constants,master):
if read.is_unmapped:
return True
return False
By returning True we tell parabam to store the read in its subset.
We would then invoke this instruction from the command line as follows:
parabam subset -p16 -c50000 -i get_unmapped_reads -b /path/to/bam/file_name.bam -o unmapped_reads
This will create a file called unmapped_reads.bam with all the unmapped reads from our BAM file.
Notice that we include the arguments -p and -c. These options tell parabam how many processors to use and how many reads to hold in memory at any one time. The more processes you allow parabam to run, the quicker it can run on all the reads in a file. However, you should not specify more processes than your computer has processors. This will cause your computer to slow down. Additionally, the more reads that you allow parabam to store in memory, the faster it will run. But beware. Specifing too many reads will make your computer run out of memory and become unresponsive.
If you need to know what any of the arguments to the subset operation are just type the following to view the help text:
parabam subset -h
Producing multiple subsets¶
Sometimes we may wish to store multiple subsets from the same BAM file. We can do this both on the command line or by invoking parabam programatically via Python.
From the command line¶
This example will create two subsets. One subset of reads which are deemd secondary aligments and one subset which failed QC. We will make use of calls to variables defined in pysam.AlignedSegment. The full documentation for this class is listed in Fundamentals.
First, we navigate to a directory where we can carry out the analysis and create an instructions file:
cd /analysis/parabam
touch multiple_subsets.py
We then need to write the code that allows us to create the subsets. It’s more involved than the single subset example above. We should edit the file multiple_subsets.py to contain the following code:
def get_subset_types():
return ["failed_qc","secondary_aligments"]
def rule(read,constants,master):
belongs = []
if read.is_qcfail:
belongs.append("failed_qc")
elif read.is_secondary:
belongs.append("secondary_aligment")
return belongs
The above code block sees us introduce the function get_subset_types. Notice how first we define the subset groups as a list of strings. These strings can be anything you want, but it make sense to label them appropriately.
Then, in the rule, we return a list of subsets to which the read belongs. For example, if a read returns True for both is_qcfail and is_secondary it will be added to both files.
We can then run parabam from the command line like so, imagining that we have a file human_sample_001.bam:
parabam subset -v2 -i multiple_subsets -b /path/to/bam/human_sample_001.bam
This will produce two files. One file called human_sample_001_failed_qc.bam and the other called human_sample_001_secondary_aligments.bam. Notice that parabam automatically handles the naming of these files.
Programatically¶
It is possible to integrate parabam subset into your program by making use of the interface class.
Simply create a new object of type parabam.subset.Interface and call the run method. The specification follows.
Subset Specification¶
Below is the parabam subset specification
Examples subset¶
Here is some code detailing useages of the subset operation