BT-FLOR: Build Transcripts From LOng Reads

The software build transcripts from long reads.

The software get bam file as input and output gtf file with all alternative spliced transcripts and exons that declared by average of mapping positions of close reads.

The gtf also contain the counts of read in each transcript.

Optinally user can supply known gtf file. In this case the gtf also contains the percent of overlap between the known gtf to the closest transcript and exon that found by the BT-FLOR.

Installation

Support: refael.kohen@weizmann.ac.il, refael.kohen@gmail.com

Requirements

Installation BT-FLOR package

pip install bbcu.bt-flor

User guide

Run pre-process the input bam file

This step filter out unwanted reads.

In stranded protocol it is also split the input bam file to 2 output bam files one for each strand.

usage: bt-flor-separate-bam-strands.py [-h] –input-file INPUT_FILE

Pre-process the input bam file

optional arguments:
-h, --help show this help message and exit
--input-file INPUT_FILE
 Full path to input .bam or .sam file (default: None)
--min-mapq MIN_MAPQ
 Minimum quality of the read mapping (default: 10)
--max-gene-length MAX_GENE_LENGTH
 Maximum length of the gene. Reads that will be mapped to longer bases will be discarded (default: 100000)
--filter-cigar FILTER_CIGAR
 Filter out reads with these characters in the cigar. For example: DSHI - filter reads with deletion/insertion/softclipped/hardclipped (default: ‘’)
--filter-tags

TAG_NAME,VALUE TAG_NAME,VALUE … [TAG_NAME,VALUE TAG_NAME,VALUE … …]

Filter out the reads that contain the substring in the value of the tag. For example: –filter-tags XF,__ filter out reads with “__” in value of XF tag (htseq-count indicate that the read mapped to non-genomic region) (default: None)

--is-stranded-protocol
 Is stranded protocol (default: False)
--log-file LOG_FILE
 Log File (default: None)

Run the main program

This step is the main program.

In stranded protocol you need run the command twice: on each of the two bam files that are created in the previous step - plus and minus strand

usage: bt-flor.py [-h] –input-file INPUT_FILE –gtf-output-file

Acurate assembly of transcripts according mapped reads

optional arguments:
-h, --help show this help message and exit
--input-file INPUT_FILE
 Full path to input .bam or .sam file (default: None)
--gtf-output-file GTF_OUTPUT_FILE
 Full path to output file name (default: None)
--is-stranded-protocol
 Is stranded protocol (default: False)
--max-dist-internal-edge-from-average MAX_DIST_INTERNAL_EDGE_FROM_AVERAGE
 Maximum distance between reads in start and end of the internal exons of the trancript (except the start of the first exon and end of the last exon) (default: 3)
--max-dist-external-edge-from-average MAX_DIST_EXTERNAL_EDGE_FROM_AVERAGE
 For non-stranded protocol: maximum distance between reads in start of the first exon and the end of the last exon of the trancript (default: 3)
--max-dist-first-edge-from-average MAX_DIST_FIRST_EDGE_FROM_AVERAGE
 For stranded protocol: maximum distance between reads in start of the first exon of the trancript (default: 3)
--max-dist-last-edge-from-average MAX_DIST_LAST_EDGE_FROM_AVERAGE
 For stranded protocol: maximum distance between reads in end of the last exon in the trancript (sometimes the enzyme drops before the end of the transcript) (default: 50)
--local-average-max-num-positions LOCAL_AVERAGE_MAX_NUM_POSITIONS
 Maximum neighbors positions for which the average will be calculated and the distance from this average will be considered) (default: 5)
--known-sorted-gtf-file KNOWN_SORTED_GTF_FILE
 Full path the known gtf file sorted by chromosome and then by start position in ascending order (you can use the command: bedtools sort -i <gtf-file>). The program will create *.gz (compressed) and *.gz.tbi (index) files in the same location of the known-gtf-file if they don’t exists. (default: None)
--threads THREADS
 number of threads. Each chromosome can run in parallel. (default: 1)
--log-file LOG_FILE
 Log File (default: None)

Releases

  • 1.0.0 - First version

The algorithm

Input file

The input is bam file with mapped reads sorted by the start mapping point.

Step 1 - grouping read according start mapping position and finding their average position

The algorithm go over the reads and check if the next read is “close” (parameter) to the average of the start mapping points of the previous reads in the “last positions” (parameter).

  • If the read is “close”, it is attached to the group of the previous reads and the algorithm recalculate the average in “last positions” with the mapping position the current read and continues to check the next read.
  • Else, the read don’t belong to the group of the previous reads, and the group is closed and is sent to the next step.

Parameters:

–local-average-num-positions (“last positions”): number of positions before the position of the current read for them we calculate the average of the mapped reads.

–max-dist-external-edge-from-average (“close”): the maximum distance in bases of the current read from the average of the previous reads (in “last positions”), for which the read called “close” to the group of the previous reads.

Step 2 - finding average of each start and end edges of the exons in the group from step 1

For each group from step 1:

Go over the start and end of the exons of the reads in the group from step 1, and for each start of an exon i (called “internal edge i”) do:

  1. Sort ascendingly the reads according the mapping point of “internal edge i”.
  2. Go over the “internal edge i” of the reads and check if it is “close” (parameter) to the average of the “internal edges i” of the previous reads in the “last positions” (parameter).
  • If “internal edge i” is “close”, it is attached to the sub-group of “internal edges i” and the algorithm recalculate the average of “internal edges i” in “last positions” with the mapping position the current exon and continues to check the “internal edge i” of the next read

  • Else, the “internal edge i” of the read don’t belong to the sub-group of the “internal edge i” and became to be “internal edge i+1” (and thus the enumerate of the all next start points of the exons of this read are uploaded by 1). In the next iteration of the loop the current “internal edge i” will be recalculate with the i+1 sub-group.

    Parameters:

    –local-average-num-positions (“last positions”): - the same parameter like in the step 1. number of positions before the position of the current read for them we calculate the average of the mapped reads.

    –max-dist-internal-edge-from-average (“close”): the maximum distance in bases of the current read from the average of the previous reads (in “last positions”), for which the read called “close” to the group of the previous reads.

  1. Return on step 2 also for the end points of the exons.
  • In the end edge of the last exon we defined “close” with the parameter from step 1: –max-dist-external-edge-from-average

Stranded protocol

  • In non-stranded protocol we cannot know which of the external edges of the read is the start of the transcript and which is the end. Therefore we use only in 2 parameters for declaring edges “external” - for the two edges of the read, and “internal” for the internal edges of the exons:

    –max-dist-internal-edge-from-average and

    –max-dist-external-edge-from-average.

  • In stranded protocol we know what is the start and the end sides of the transcript, so we use with other 2 parameters instead the parameter –max-dist-internal-edge-from-average:

    –max-dist-first-edge-from-average and

    –max-dist-end-edge-from-average.

Generally, the end edge is more noisy, so the user can enable more flexibility.

In stranded protocol the algorithm get separate bam file for each strand and run on them separately. The package contains a script for the separating the bam to the differenct strands.

Step 3 - build transcripts

For each group from step 2, go over on each read in the group, and combine the reads with the same average of edges - internal and external. Each such group called transcript.

Output

Gtf file - contains the transcripts and its exons. Each transcript has the counts of reads in this transcript.

Step 4 (optional) - intersection with known gtf

If the user supply input of gtf with known genes, the output will contain also the intersection of each transcript and exon with the nearest known gene and the percent of the overlapping.

License

BT-FLOR is licensed under GNU General Public License version 3. License needed for commercial use.

Author

Refael Kohen,

refael.kohen@weizmann.ac.il, refael.kohen@gmail.com

Bioinformatics unit at Life Sciences Core Facilities (LSCF)

Weizmann Institute of Science, Rehovot 76100, Israel.