BT-FLOR: Build Transcripts From LOng Reads¶
The software build transcripts from long reads.
The software get bam file as input and output gtf file with all alternative spliced transcripts and exons that declared by average of mapping positions of close reads.
The gtf also contain the counts of read in each transcript.
Optinally user can supply known gtf file. In this case the gtf also contains the percent of overlap between the known gtf to the closest transcript and exon that found by the BT-FLOR.
Installation¶
Support: refael.kohen@weizmann.ac.il, refael.kohen@gmail.com
Requirements¶
python 2.7
python package: pysam==0.15.2
- installation insruction for pysam you can find here:
-
Example of installation of pysam with conda:
conda create -n btflor python=2.7 conda activate btflor conda config --add channels r conda config --add channels bioconda conda install pysam==0.15.2
Installation BT-FLOR package¶
pip install bbcu.bt-flor
User guide¶
Run pre-process the input bam file¶
This step filter out unwanted reads.
In stranded protocol it is also split the input bam file to 2 output bam files one for each strand.
usage: bt-flor-separate-bam-strands.py [-h] –input-file INPUT_FILE
Pre-process the input bam file
- optional arguments:
-h, --help show this help message and exit --input-file INPUT_FILE Full path to input .bam or .sam file (default: None) --min-mapq MIN_MAPQ Minimum quality of the read mapping (default: 10) --max-gene-length MAX_GENE_LENGTH Maximum length of the gene. Reads that will be mapped to longer bases will be discarded (default: 100000) --filter-cigar FILTER_CIGAR Filter out reads with these characters in the cigar. For example: DSHI - filter reads with deletion/insertion/softclipped/hardclipped (default: ‘’) --filter-tags TAG_NAME,VALUE TAG_NAME,VALUE … [TAG_NAME,VALUE TAG_NAME,VALUE … …]
Filter out the reads that contain the substring in the value of the tag. For example: –filter-tags XF,__ filter out reads with “__” in value of XF tag (htseq-count indicate that the read mapped to non-genomic region) (default: None)
--is-stranded-protocol Is stranded protocol (default: False) --log-file LOG_FILE Log File (default: None)
Run the main program¶
This step is the main program.
In stranded protocol you need run the command twice: on each of the two bam files that are created in the previous step - plus and minus strand
usage: bt-flor.py [-h] –input-file INPUT_FILE –gtf-output-file
Acurate assembly of transcripts according mapped reads
- optional arguments:
-h, --help show this help message and exit --input-file INPUT_FILE Full path to input .bam or .sam file (default: None) --gtf-output-file GTF_OUTPUT_FILE Full path to output file name (default: None) --is-stranded-protocol Is stranded protocol (default: False) --max-dist-internal-edge-from-average MAX_DIST_INTERNAL_EDGE_FROM_AVERAGE Maximum distance between reads in start and end of the internal exons of the trancript (except the start of the first exon and end of the last exon) (default: 3) --max-dist-external-edge-from-average MAX_DIST_EXTERNAL_EDGE_FROM_AVERAGE For non-stranded protocol: maximum distance between reads in start of the first exon and the end of the last exon of the trancript (default: 3) --max-dist-first-edge-from-average MAX_DIST_FIRST_EDGE_FROM_AVERAGE For stranded protocol: maximum distance between reads in start of the first exon of the trancript (default: 3) --max-dist-last-edge-from-average MAX_DIST_LAST_EDGE_FROM_AVERAGE For stranded protocol: maximum distance between reads in end of the last exon in the trancript (sometimes the enzyme drops before the end of the transcript) (default: 50) --local-average-max-num-positions LOCAL_AVERAGE_MAX_NUM_POSITIONS Maximum neighbors positions for which the average will be calculated and the distance from this average will be considered) (default: 5) --known-sorted-gtf-file KNOWN_SORTED_GTF_FILE Full path the known gtf file sorted by chromosome and then by start position in ascending order (you can use the command: bedtools sort -i <gtf-file>). The program will create *.gz (compressed) and *.gz.tbi (index) files in the same location of the known-gtf-file if they don’t exists. (default: None) --threads THREADS number of threads. Each chromosome can run in parallel. (default: 1) --log-file LOG_FILE Log File (default: None)
Releases¶
- 1.0.0 - First version
Source code¶
The algorithm¶
Input file¶
The input is bam file with mapped reads sorted by the start mapping point.
Step 1 - grouping read according start mapping position and finding their average position¶
The algorithm go over the reads and check if the next read is “close” (parameter) to the average of the start mapping points of the previous reads in the “last positions” (parameter).
- If the read is “close”, it is attached to the group of the previous reads and the algorithm recalculate the average in “last positions” with the mapping position the current read and continues to check the next read.
- Else, the read don’t belong to the group of the previous reads, and the group is closed and is sent to the next step.
Parameters:
–local-average-num-positions (“last positions”): number of positions before the position of the current read for them we calculate the average of the mapped reads.
–max-dist-external-edge-from-average (“close”): the maximum distance in bases of the current read from the average of the previous reads (in “last positions”), for which the read called “close” to the group of the previous reads.
Step 2 - finding average of each start and end edges of the exons in the group from step 1¶
For each group from step 1:
Go over the start and end of the exons of the reads in the group from step 1, and for each start of an exon i (called “internal edge i”) do:
- Sort ascendingly the reads according the mapping point of “internal edge i”.
- Go over the “internal edge i” of the reads and check if it is “close” (parameter) to the average of the “internal edges i” of the previous reads in the “last positions” (parameter).
If “internal edge i” is “close”, it is attached to the sub-group of “internal edges i” and the algorithm recalculate the average of “internal edges i” in “last positions” with the mapping position the current exon and continues to check the “internal edge i” of the next read
Else, the “internal edge i” of the read don’t belong to the sub-group of the “internal edge i” and became to be “internal edge i+1” (and thus the enumerate of the all next start points of the exons of this read are uploaded by 1). In the next iteration of the loop the current “internal edge i” will be recalculate with the i+1 sub-group.
Parameters:
–local-average-num-positions (“last positions”): - the same parameter like in the step 1. number of positions before the position of the current read for them we calculate the average of the mapped reads.
–max-dist-internal-edge-from-average (“close”): the maximum distance in bases of the current read from the average of the previous reads (in “last positions”), for which the read called “close” to the group of the previous reads.
- Return on step 2 also for the end points of the exons.
- In the end edge of the last exon we defined “close” with the parameter from step 1: –max-dist-external-edge-from-average
Stranded protocol¶
In non-stranded protocol we cannot know which of the external edges of the read is the start of the transcript and which is the end. Therefore we use only in 2 parameters for declaring edges “external” - for the two edges of the read, and “internal” for the internal edges of the exons:
–max-dist-internal-edge-from-average and
–max-dist-external-edge-from-average.
In stranded protocol we know what is the start and the end sides of the transcript, so we use with other 2 parameters instead the parameter –max-dist-internal-edge-from-average:
–max-dist-first-edge-from-average and
–max-dist-end-edge-from-average.
Generally, the end edge is more noisy, so the user can enable more flexibility.
In stranded protocol the algorithm get separate bam file for each strand and run on them separately. The package contains a script for the separating the bam to the differenct strands.
Step 3 - build transcripts¶
For each group from step 2, go over on each read in the group, and combine the reads with the same average of edges - internal and external. Each such group called transcript.
Output¶
Gtf file - contains the transcripts and its exons. Each transcript has the counts of reads in this transcript.
Step 4 (optional) - intersection with known gtf¶
If the user supply input of gtf with known genes, the output will contain also the intersection of each transcript and exon with the nearest known gene and the percent of the overlapping.
License¶
BT-FLOR is licensed under GNU General Public License version 3. License needed for commercial use.
Author¶
Refael Kohen,
refael.kohen@weizmann.ac.il, refael.kohen@gmail.com
Bioinformatics unit at Life Sciences Core Facilities (LSCF)
Weizmann Institute of Science, Rehovot 76100, Israel.