Documentation

Contents:

Repeat masking

Contents:

BuildDatabase

Wrapper around makeblastdb to format blast databases for use with RepeatModeler.

Program parameters

-name <database name>
    The name of the database to create.

-engine <engine name>
    The name of the search engine we are using. I.e abblast/wublast or
    ncbi (rmblast version).

-dir <directory>
    The name of a directory containing fasta files to be processed. The
    files are recognized by their suffix. Only *.fa and *.fasta files
    are processed.

-batch <file>
    The name of a file which contains the names of fasta files to
    process. The files names are listed one per line and should be fully
    qualified.

Typical usage

/exports/software/repeatmodeler/RepeatModeler/BuildDatabase -engine ncbi -name mydb seq.fa

Gridengine wrapper

See RepeatModeler.

RepeatModeler

RepeatModeler (link)

version 1.08

de novo repeat family identification and modeling using RECON, RepeatScout, TRF & RMBlast [*] on blast databases formatted with BuildDatabase.

Program parameters

-database
    The prefix name of a XDF formatted sequence database containing the
    genomic sequence to use when building repeat models. The database
    may be created with the WUBlast "xdformat" utility or with the
    RepeatModeler wrapper script "BuildXDFDatabase".

-engine <abblast|wublast|ncbi>
    The name of the search engine we are using. I.e abblast/wublast or
    ncbi (rmblast version).

-pa #
    Specify the number of shared-memory processors available to this
    program. RepeatModeler will use the processors to run BLAST searches
    in parallel. i.e on a machine with 10 cores one might use 1 core for
    the script and 9 cores for the BLAST searches by running with "-pa
    9".

-recoverDir <Previous Output Directory>
    If a run fails in the middle of processing, it may be possible
    recover some results and continue where the previous run left off.
    Simply supply the output directory where the results of the failed
    run were saved and the program will attempt to recover and continue
    the run.

Typical usage

/exports/software/repeatmodeler/RepeatModeler/RepeatModeler -database mydb -engine ncbi -pa 32

Gridengine wrapper

RepeatModeler will use more threads than defined by -pa so use nice

repeatmodeler.sh:

#!/bin/bash

#$ -V
#$ -cwd
#$ -j y
#$ -o $JOB_ID.log

# define $DATADIR and $DATABASE before/when calling this script, e.g.:
#       export DATADIR=`pwd`
#       qsub -v SEQFILE=seq.fa -v DATABASE=mydb ...

WORKDIR=/scratch/$USER/repeatmodeler/$JOB_ID
mkdir -p $WORKDIR
cp $DATADIR/$SEQFILE $WORKDIR
cd $WORKDIR
/exports/software/repeatmodeler/RepeatModeler/BuildDatabase -engine ncbi -name $DATABASE $SEQFILE
nice -n 10 /exports/software/repeatmodeler/RepeatModeler/RepeatModeler -database $DATABASE -engine ncbi -pa $NSLOTS
rsync -a ./consensi.fa.classified $DATADIR/$SEQFILE.consensi.fa.classified

qsub command

export DATADIR=`pwd`
qsub -v SEQFILE=seq.fa -v DATABASE=mydb -pe smp 32 /path/to/repeatmodeler.sh

parallel qsub command

Submit a job for each .fa file in a directory, limit to 2 simultaneous jobs to avoid flooding queue.

export DATADIR=`pwd`
parallel -j 2 qsub -v SEQFILE={} -v DATABASE={.} -pe smp 32 -sync y /path/to/repeatmodeler.sh ::: *.fa

Footnotes

[*]Alternatively ABBlast/WUBlast may be used.

Filter RepeatModeler Library

Repeat libraries built with RepeatModeler may contain non-TE protein coding sequences. If masking an annotated genome, filter the library using RepeatMasker RepeatPeps.lib and a proteome to remove ‘genuine’ protein hits.

Filtering steps

Blast proteome against RepeatMasker TE database
blastp -query proteins.fa \
       -db /exports/software/repeatmasker/repeatmasker-4-0-5/Libraries/RepeatPeps.lib \
       -outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \
       -max_target_seqs 25 \
       -culling_limit 2 \
       -num_threads 48 \
       -evalue 1e-5 \
       -out proteins.fa.vs.RepeatPeps.25cul2.1e5.blastp.out
Remove TEs from proteome
fastaqual_select -f transcripts.fa \
       -e <(awk '{print $1}' proteins.fa.vs.RepeatPeps.25cul2.1e5.blastp.out | sort | uniq) > transcripts.no_tes.fa
Blast proteome against RepeatModeler library
makeblastdb -in transcripts.no_tes.fa -dbtype nucl
blastn -task megablast \
       -query consensi.fa.classified \
       -db transcripts.no_tes.fa \
       -outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \
       -max_target_seqs 25 \
       -culling_limit 2 \
       -num_threads 48 \
       -evalue 1e-10 \
       -out repeatmodeller_lib.vs.transcripts.no_tes.25cul2.1e10.megablast.out
Remove hits from RepeatModeler library
fastaqual_select -f consensi.fa.classified \
       -e <(awk '{print $1}' ../../repeatmodeller_lib.vs.transcripts.no_tes.25cul2.1e25.megablast.out | sort | uniq) > consensi.fa.classified.filtered_for_CDS_repeats.fa

Grid engine wrapper

filter_repmodlib.sh:

#!/bin/bash

#$ -V
#$ -cwd
#$ -j y
#$ -o $JOB_ID.log
. /etc/profile.d/modules.sh
module load blast

# export DATADIR=/path/to/
# qsub -v PROTSEQFILE=proteins.fa
#      -v BLASTPDB=/path/to/RepeatPeps.lib
#      -v TSCSEQFILE=transcripts.fa
#      -v REPMODLIB=consensi.fa.classified

WORKDIR=/scratch/$USER/blastp/$JOB_ID
mkdir -p $WORKDIR
cd $WORKDIR

# Blast proteome against RepeatMasker TE database
blastp -query $DATADIR/$PROTSEQFILE -db $BLASTPDB -outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \
       -max_target_seqs 25 -culling_limit 2 -num_threads $NSLOTS -evalue 1e-5 \
       -out $PROTSEQFILE.vs.RepeatPeps.25cul2.1e5.blastp.out

# Remove TEs from proteome
/exports/colossus/lepbase/repeatmasking/fastaqual_select -f $DATADIR/$TSCSEQFILE \
       -e <(awk '{print $1}' $PROTSEQFILE.vs.RepeatPeps.25cul2.1e5.blastp.out | sort | uniq) > $TSCSEQFILE.no_tes.fa

# Blast proteome against RepeatModeler library
makeblastdb -in $TSCSEQFILE.no_tes.fa -dbtype nucl
blastn -task megablast \
       -query $DATADIR/$REPMODLIB \
       -db $TSCSEQFILE.no_tes.fa \
       -outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \
       -max_target_seqs 25 \
       -culling_limit 2 \
       -num_threads $NSLOTS \
       -evalue 1e-10 \
       -out $REPMODLIB.vs.$TSCSEQFILE.25cul2.1e10.megablast.out

# Remove hits from RepeatModeler library
/exports/colossus/lepbase/repeatmasking/fastaqual_select -f $DATADIR/$REPMODLIB \
       -e <(awk '{print $1}' $REPMODLIB.vs.$TSCSEQFILE.25cul2.1e10.megablast.out | sort | uniq) > $REPMODLIB.filtered_for_CDS_repeats.fa

rsync -a --remove-source-files ./$REPMODLIB.filtered_for_CDS_repeats.fa $DATADIR
qsub command
export DATADIR=`pwd`
qsub -pe smp 16 -v PROTSEQFILE=proteins.fa -v BLASTPDB=/path/to/RepeatPeps.lib -v TSCSEQFILE=transcripts.fa -v REPMODLIB=consensi.fa.classified /path/to/filter_repmodlib.sh

Combine Repeat Libraries

Use the RepeatMasker tool queryRepeatDatabase.pl to extract taxon-specific repeats:

/exports/software/repeatmasker/repeatmasker-4-0-5/util/queryRepeatDatabase.pl \
       -species "taxon" > repeatmasker.taxon.fa

and combine with a repeatmodeller library. The output of queryRepeatDatabase.pl has an additional line at the begining before the start of the fasta format sequences so use tail -n+2 to skip the first line:

tail -n+2 repeatmasker.taxon.fa | cat - taxon.consensi.fa.classified > taxon.repeatlib.fa

RepeatMasker

Typical Usage

/exports/software/repeatmasker/repeatmasker-4-0-5/RepeatMasker \
       -pa 48 \
       -lib taxon.repeatlib.fa \
       -dir . \
       -xsmall \
       seq.fa

Gridengine wrapper

repeatmasker.sh:

#!/bin/bash

#$ -V
#$ -cwd
#$ -j y
#$ -o $JOB_ID.log

#       export DATADIR=`pwd`
#              -v SEQFILE=seq.fa
#              -v REPLIB=repeatLib.fa ...

WORKDIR=/scratch/$USER/repeatmasker/$JOB_ID
mkdir -p $WORKDIR
cd $WORKDIR
/exports/software/repeatmasker/repeatmasker-4-0-5/RepeatMasker \
       -pa $NSLOTS \
       -lib $DATADIR/$REPLIB \
       -dir . \
       -xsmall \
       $DATADIR/$SEQFILE
rsync -a ./$SEQFILE.* $DATADIR/

qsub command

export DATADIR=`pwd`
qsub -v SEQFILE=seq.fa -v REPLIB=repeatLib.fa -pe smp 32 /path/to/repeatmasker.sh

useful one-liners

summarise genome size and repeat content across multiple results files in a
folder (pattern matching specific to lepbase naming conventions):
perl -0777 -ne '$ARGV=~s/_-.+//;m/total.+?(\d+).+bases.+?(\d+)/s;print "$ARGV\t$1\t",($2/$1*100),"\n"' *_-_scaffolds.fa.tbl

TRF

Tandem Repeats Finder [1] (link)

version 4.07b

Program parameters

$ /exports/software/trf/trf-4.07b/trf

Tandem Repeats Finder, Version 4.07b
Copyright (C) Dr. Gary Benson 1999-2012. All rights reserved.


Please use: /exports/software/trf/trf-4.07b/trf File Match Mismatch Delta PM PI Minscore MaxPeriod [options]

Where: (all weights, penalties, and scores are positive)
  File = sequences input file
  Match  = matching weight
  Mismatch  = mismatching penalty
  Delta = indel penalty
  PM = match probability (whole number)
  PI = indel probability (whole number)
  Minscore = minimum alignment score to report
  MaxPeriod = maximum period size to report
  [options] = one or more of the following
               -m    masked sequence file
               -f    flanking sequence
               -d    data file
               -h    suppress html output
               -r    no redundancy elimination
               -ngs  more compact .dat output on multisequence files, returns 0 on success. You may pipe input in with this option using - for file name. Short 50 flanks are appended to .dat output. See more information on TRF Unix Help web page.

Recommended usage (plus html output suppression)

trf yoursequence.txt 2 7 7 80 10 50 500 -f -d -m -h

Program outputs

Outputs are written to input directory, using input filename suffixed by parameter values and file type

infile.2.7.7.80.10.50.500.mask

Hard-masked fasta format sequence file (option -m):

>seq1
GTTGTTTTTCTGAGACCGAAAAGGCTTGTTTTTATGTAGCAAGTTCAAGAACTGGTTTCG
GTATCGCAGGGTATGTATGGTCTGAGACGTATAACTGTCACGANNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNTATGATGAATGGGTAATTTTTAAAGTATCTCACATAACTTTTAGATG
TAGTAAGACCATGTTTGCGAATCGAATCTCTTACTGGTTGAACGCCAAAGGACCTTCCAT
...
>seq2
...
infile.2.7.7.80.10.50.500.dat

Text file with space delimited blocks (option -d):

Tandem Repeats Finder Program written by:

Gary Benson
Program in Bioinformatics
Boston University
Version 4.07b


Sequence: seq1



Parameters: 2 7 7 80 10 50 500


3404 3433 14 2.1 14 93 0 51 30 16 30 23 1.96 AGCAGCACTTGAGT AGCAGTACTTGAGTAGCAGCACTTGAGTAG
6344 6383 18 2.2 18 95 0 71 45 2 10 42 1.51 ATTATTATAAGGAATTAC ATTATTATAAGGAATTATATTATTATAAGGAATTACATTA
...
6237875 6237915 16 2.6 16 96 3 75 0 4 48 46 1.23 TGTGTGTGTGTGCGTG TGTGTGTGTGTGCGTGTGTGTGTGTGTGCGTGTGTTGTGTG


Sequence: seq2



Parameters: 2 7 7 80 10 50 500


15030 15056 14 1.9 14 100 0 54 66 0 0 33 0.92 TTAAATAAATAAAT TTAAATAAATAAATTTAAATAAATAAA
...

Each block contains:

cols 1+2:  Indices of the repeat relative to the start of the sequence
col 3:     Period size of the repeat
col 4:     Number of copies aligned with the consensus pattern
col 5:     Size of consensus pattern (may differ slightly from the period size)
col 6:     Percent of matches between adjacent copies overall
col 7:     Percent of indels between adjacent copies overall
col 8:     Alignment score
cols 9-12: Percent composition for each of the four nucleotides
col 13:    Entropy measure based on percent composition
col 14:    Consensus sequence
col 15:    Repeat sequence

Grid engine wrapper

trf.sh:

#!/bin/bash

#$ -V
#$ -cwd
#$ -j y
#$ -o $JOB_ID.log

# define $DATADIR and $SEQFILE before/when calling this script, e.g.:
#       export DATADIR=`pwd`
#       qsub -v SEQFILE=seq.fa ...

WORKDIR=/scratch/$USER/trf/$JOB_ID
mkdir -p $WORKDIR
cp $DATADIR/$SEQFILE $WORKDIR
cd $WORKDIR
nice -n 10 /exports/software/trf/trf-4.07b/trf $SEQFILE 2 7 7 80 10 50 500 -f -d -m -h
rsync -a --remove-source-files ./${SEQFILE}* $DATADIR

qsub command

export DATADIR=`pwd`
qsub -v SEQFILE=seq.fa /path/to/trf.sh

parallel command

Split large sequence file into blocks with pyfasta, process each block using parallel to submit to gridengine with one slot per job, then combine outputs into single .mask and .dat files.

pip install pyfasta
pip install numpy
export DATADIR=`pwd`
export INFILE=seq # OMIT EXTENSION!
export SCRIPTDIR=~
pyfasta split -n 32 $INFILE
parallel -j 32 qsub -q nice.q -v SEQFILE={} -sync y $SCRIPTDIR/trf.sh ::: $INFILE.??.fa
cat $INFILE.??.fa.*.mask > $INFILE.fa.mask
cat $INFILE.??.fa.*.dat > $INFILE.fa.dat
rm *.log
rm $INFILE.??.fa.*.mask
rm $INFILE.??.fa.*.dat
rm $INFILE.??.fa

References

[1]Benson G (1999) Tandem repeats finder: a program to analyze DNA sequences Nucleic Acids Research 27, 573-580.

Virtualisation

Contents:

Install Ubuntu VM

Initial KVM setup if this is the first VM:

virsh pool-define-as --name kvmpool --type dir --target /var/kvm/images
virsh pool-start kvmpool

Install a Ubuntu 14.04 VM direct from archive.ubuntu.com:

virt-install \
--name example \
--ram 4096 \
--disk path=/var/kvm/images/example.qcow2,size=24,format=qcow2 \
--vcpus=2,maxvcpus=6 \
--cpu host \
--os-type linux \
--os-variant ubuntutrusty \
--graphics none \
--console pty,target_type=serial \
--location 'http://uk.archive.ubuntu.com/ubuntu/dists/trusty/main/installer-amd64/' \
--extra-args 'console=ttyS0,115200n8 serial'

Whole genome alignment

Contents:

Progressive Cactus

Progressive Cactus (link)

Typical Usage

/exports/software/progressiveCactus/bin/runProgressiveCactus.sh \
    /path/to/align.txt \
    /path/to/data/directory \
    /path/to/output/align.hal \
    --database kyoto_tycoon \
    --maxThreads 64

Input format

align.txt:

(tax1,(tax2,tax3));
tax1 /path/to/tax1.fa
tax2* /path/to/tax2.fa
tax3 /path/to/tax3.fa

Contains a newick tree on line 1 (don’t forget the semicolon) followed by a set of mappings from taxon tames to (soft-masked) fasta files.

A taxon name with and asterisk is marked as reference quality (if no asterisks, all sequences will be treated as reference quality)

Gridengine wrapper

progressivecactus.sh (Progressive Cactus alignment followed by conversion to an Assembly Hub and a MAF file):

#!/bin/bash

#$ -V
#$ -cwd
#$ -j y
#$ -o $JOB_ID.log
#$ -pe smp 64

SCRATCH=/scratch/$USER/wga/$JOB_ID
mkdir -p $SCRATCH/$DIR
cd $SCRATCH

export PYTHONPATH=$PYTHONPATH:/exports/software/progressiveCactus/submodules/biopython
. /exports/software/progressiveCactus/environment

nice -n 10 /exports/software/progressiveCactus/bin/runProgressiveCactus.sh $DATA/$INFILE ./$DIR ./$DIR/$HALFILE --database kyoto_tycoon --maxThreads $NSLOTS

if ! [ -z $HUBFILE ]; then
nice -n 10 /exports/software/progressiveCactus/submodules/hal/bin/hal2assemblyHub.py --gcContent --maxThreads=$NSLOTS --defaultMemory=10000000000 ./$DIR/$HALFILE ./$DIR/$HUBFILE
fi

if ! [ -z $MAFFILE ]; then
nice -n 10 /exports/software/progressiveCactus/submodules/hal/bin/hal2mafMP.py ./$DIR/$HALFILE $MAFFILE --numProc $NSLOTS --refGenome $REFGENOME
fi

rsync -av --remove-source-files ./$DIR $DATA/output

find ./$DIR -depth -type d -empty -delete

qsub command

qsub -v SEQFILE=seq.fa \
     -v DIR=input \
     -v DATA=/path/to/data \
     -v INFILE=align.txt \
     -v HALFILE=align.hal \
     -v HUBFILE=align.hub \
     -v MAFFILE=align.maf \
     -v REFGENOME=reference_genome_name \
     -pe smp 32 \
     /path/to/repeatmasker.sh

Indices and tables