Bio Bits¶
Various bioinformatics scripts
All documentation is hosted at http://bio-bits.readthedocs.org/en/latest
TODO¶
- Include existing scripts
Contents:
Installation¶
It is recommended to install into a virtualenv. If you know what you are doing and don’t want to install into virtualenv, then you can skip right to step 3
Setup Virtualenv
It is assumed you have virtualenv already installed. If not see https://virtualenv.pypa.io/en/latest/installation.html
virtualenv env
Activate virtualenv
. env/bin/activate
Install dependencies
pip install -r requirements.txt
For python 2.6 you will need to also install some additional packages
pip install -r requirements-py26.txt
Install bio_bits
python setup.py install
Scripts¶
rename_fasta¶
Many times you find you have a fasta file where the identifiers are all wrong and you want to rename them all via some mapping file.
Take the example where you have the following fasta file(example.fasta):
>id1
ATGC
>id2
ATGC
>id3
ATGC
You want to rename each identifier(id1, id2, id3) based on a mapping you have. In a file called renamelist.csv you would have the following:
#From,To
id1,samplename1
id2,samplename2
id3,samplename3
Then to rename your fasta without replacing the original file you have two options:
Rename without replacing original file
rename_fasta renamelist.csv example.fasta > renamedfasta.fasta
Rename replacing original file’s contents
reanme_fasta renamelist.csv example.fasta --inplace
Rename Mapping File Syntax¶
The file you specify as the rename map file is a simple comma separated text file.
The following rules apply to the format:
- The first entry is the identifier to find in the supplied fasta file.
- The second entry is what to replace the found identifier with
- Any line beginning with a pound sign(#) will be ignored by the renamer
Missing identifiers that are in fasta but not rename file¶
In the case where your fasta file contains an identifier that is not in the rename map file you supply, an error will be displayed in the console telling you as such:
idwhatever is not in provided mapping
beast_checkpoint¶
beast_checkpoint is a fork of https://gist.github.com/trvrb/5277297 that has been rewritten in python and slightly improved as the ruby script seemed to have a few errors.
It accepts any previously run or terminated beast run and will generate an xml file that essentially starts from the last generated tree/log state.
Since beast is random in nature, there does not appear to be a way to restart the run exactly from the same state that it left off.
Example¶
We will use the benchmark2.xml file that comes with Beast 1.8 This file is located in:
BEASTv1.8.0/examples/Benchmarks/benchmark2.xml
First you need to fix the benchmark2.xml because each taxa has a trailing space and that is annoying
$> sed 's/ "/"/' benchmark2.xml > beast.xml
Now run beast for about half of the iterations and hit CTRL-C to kill it This benchmark is set to run 1,000,000 iterations so around 500,000 you can kill it. Notice we are using a predefined seed
$> seed=1234567890
$> mkdir run1
$> cp beast.xml run1/beast.xml
$> beast -seed $seed -beagle_SSE beast.xml
Now we will want to re-run beast from that last state. We can use beast_checkpoint to do so by supplying the original xml and the produced trees and log files. We will put the new xml into a new directory since the .trees and .log files would create an error or possibly be overwritten.
NOTE If your fileLog and treeFileLog do not have the same logEvery then when beast exits you may end up with more/less tree states than log states. For now you will have to manually edit the files and ensure that the last tree state matches the last log state.
Todo
Could be possible to get beast_checkpoint to check for that scenario and use the last tree state that matches the last log state
$> mkdir run2
$> beast_checkpoint beast.xml *.trees *.log > run2/beast.xml
Now you can simply just re-run beast on the new xml using the same seed
$> cd run2
$> beast -seed $seed -beagle_SSE beast.xml
Tracer¶
If you name your runs sequentially as we did in the example(aka, run1, run2,...) then you can easily load all log files into tracer via the command line as follows
tracer run*/*.log
LogCombiner¶
After you have run all your beast checkpointed xml files you will probably want to combine them with logcombiner which comes with beast
beast_wrapper¶
Beast wrapper is intended as a helper script to run beast. At this point it just runs beast with the same arguments you would normally give to beast from the command line and just adds a estimated time left column to the console output
Example¶
$> beast_wrapper -beagle_SSE my_beast.xml
...
state Posterior Prior Likelihood rootHeight my_beast.ucld.mean location.clock.rate location.nonZeroRates
0 -86527.5880 -6850.8316 -79676.7564 57.6772 1.16103E-3 4.86012 15.0000 -
20000 -29044.3753 -1123.5287 -27920.8466 288.102 3.02471E-4 0.11891 16.0000 0.21 hours/million states 2d 04:29:44
40000 -25517.9525 -979.5343 -24538.4182 211.705 1.35118E-4 0.25060 16.0000 0.25 hours/million states 2d 14:29:24
60000 -24212.1250 -1040.4103 -23171.7147 188.454 1.05572E-4 0.18908 15.0000 0.25 hours/million states 2d 14:29:06
80000 -24097.9354 -1019.8099 -23078.1256 182.242 1.53593E-4 0.12857 16.0000 0.26 hours/million states 2d 16:58:45
100000 -24121.5382 -1105.6545 -23015.8837 178.060 1.26907E-4 0.10367 17.0000 0.27 hours/million states 2d 19:28:22
120000 -23930.6897 -1105.7390 -22824.9507 187.411 1.01885E-4 0.34214 17.0000 0.27 hours/million states 2d 19:28:03
140000 -23869.4856 -1087.1915 -22782.2942 178.535 8.76375E-5 0.26128 18.0000 0.26 hours/million states 2d 16:57:48
group_references¶
group_references splits an alignment file by reference into seperate FASTQ files. group_references takes a SAM or BAM file as input, and can optionally be given an output directory where the FASTQ files will be saved. If not output directory name is provided, the files will be saved in the new folder group_references_out.
$> group_references contigs.bam
$> group_references contigs.bam --outdir split_fastqs
degen¶
Find genes where a sequence has degenerate bases.
How-to¶
- Usage:
- degen.py <fasta> <options>
Options:
--gb-id=<accession_id> Accession id for reference --gb-file=<gbfile> Local Genbank file for reference --tab-file=<tabfile> TSV/CSV file for reference with fields name,start,end
Example:¶
degen sequence.fasta --gb-id 12398.91 degen sequence.fasta --gb-file tests/testinput/sequence.gb degen sequence.fasta --tab-file tests/testinput/degen.tab degen sequence.fasta --tab-file tests/testinput/degen.csv
Output:¶
Gene name, degnerate position, degenerate base:
anchored capsid protein 85 R
anchored capsid protein 88 Y
membrane glycoprotein precursor 509 R
nonstructural protein NS5 8513 Y
nonstructural protein NS5 8514 Y
nonstructural protein NS5 8515 Y
anchored capsid protein 85 R
anchored capsid protein 88 Y
membrane glycoprotein precursor 509 R
nonstructural protein NS5 8513 Y
nonstructural protein NS5 8514 Y
nonstructural protein NS5 8515 Y
Gene/Tab File¶
degen.tab could look like:
genename start stop
foo 1 2
bar 9 33
The headers do not matter, but the start field must always come before the stop field, so the below example would also be valid:
start GENEName stop
1 foo 2
9 bar 33
or optionally without headers:
1 foo 2
9 bar 33
alternatively, with commas in place of tabs:
name,start,stop
foo,1,2
bar,9,33
You can also specify a coding region(CDS) in your file as well:
name,start,stop
CDS,3,33
foo,1,2
bar,9,33
Genbank File¶
As downloaded from NCBI’s entrez database. Use this option if you don’t have internet access.
An example
LOCUS KJ189367 10452 bp ss-RNA linear VRL 10-FEB-2014
DEFINITION Dengue virus 1 isolate DENV-1/PR/BID-V8188/2010, complete genome.
ACCESSION KJ189367
VERSION KJ189367.1 GI:582052497
DBLINK BioProject: PRJNA31235
KEYWORDS .
SOURCE Dengue virus 1
ORGANISM Dengue virus 1
Viruses; ssRNA viruses; ssRNA positive-strand viruses, no DNA
stage; Flaviviridae; Flavivirus; Dengue virus group.
REFERENCE 1 (bases 1 to 10452)
AUTHORS Zody,M.C., Newman,R.M., Henn,M., Munoz-Jordan,J., McElroy,K.L.,
Santiago,G., Poon,T.W., Charlebois,P., Weiner,B., Yang,X.,
Piper,M.E., Fitzgerald,M., McCowan,C., Young,S., Gargeya,S.,
Levin,J., Malboeuf,C., Qu,J., Ireland,A., Chapman,S.B., Murphy,C.,
Wortman,J., Nusbaum,C. and Birren,B.
CONSRTM Genome Resources in Dengue Consortium; The Broad Institute Genomics
Platform; The Broad Institute Genome Sequencing Center for
Infectious Disease; Centers for Disease Control and Prevention
Division of Vector Borne Infectious Diseases; CDC Dengue Branch
Puerto Rico
TITLE Direct Submission
JOURNAL Submitted (22-JAN-2014) Broad Institute of MIT & Harvard, 7
Cambridge Center, Cambridge, MA 02142, USA
COMMENT ##Assembly-Data-START##
Assembly Method :: Vicuna v. 1
Sequencing Technology :: Illumina
##Assembly-Data-END##
FEATURES Location/Qualifiers
source 1..10452
/organism="Dengue virus 1"
/mol_type="genomic RNA"
/isolate="DENV-1/PR/BID-V8188/2010"
/isolation_source="cell supernatant"
/host="Homo sapiens"
/db_xref="taxon:11053"
/country="Puerto Rico"
/collection_date="2010"
/note="cell passage history: C6/36 1; cohort population:
Dengue Surveillance;
type: 1"
5'UTR 1..83
/note="indels in UTR have not been validated"
CDS 84..10262
/codon_start=1
/product="polyprotein"
/protein_id="AHI43750.1"
/db_xref="GI:582052498"
/translation="MNNQRKKTGRPSFNMLKRARNRVSTGSQLAKRFSKGLLSGQGPM
KLVMAFIAFLRFLAIPPTAGILARWSSFKKNGAIKVLRGFKKEISSMLNIMNRRKRSV
TMLLMLLPTALAFHLTTRGGEPHMIVSKQERGKSLLFKTSAGVNMCTLIAMDLGELCE
DTMTYKCPRITEAEPDDVDCWCNATDTWVTYGTCSQTGEHRREKRSVALAPHVGLGLE
TRTETWMSSEGAWKQIQRVETWALRHPGFTVIAFFLAHAIGTSITQKGIIFILLMLVT
PSMAMRCVGIGNRDFVEGLSGATWVDVVLEHGSCVTTMAKNKPTLDIELLKTEVTNPA
VLRKLCIEAKISNTTTDSRCPTQGEATLVEEQDANFVCRRTFVDRGWGNGCGLFGKGS
LLTCAKFKCVTKLEGKIVQYENLKYSVIVTVHTGDQHQVGNETTEHGTIATITPQAPT
SEIQLTDYGALTLDCSPRTGLDFNEMVLLTMKEKSWLVHKQWFLDLPLPWTSGASTSQ
ETWNRQDLLVTFKTAHAKKQEVVVLGSQEGAMHTALTGATEIQTSGTTTIFAGHLKCR
LKMDKLTLKGMSYVMCTGSFKLEKEVAETQHGTVLVQVKYEGTDAPCKIPFSTQDEKG
VTQNGRLITANPIVTDKEKPVNIETEPPFGESYIVVGAGEKALKLSWFKRGSSIGKMF
EATARGARRMAILGDTAWDFGSIGGVFTSVGKLVHQIFGTAYGVLFSGVSWTMKIGIG
ILLTWLGLNSRSTSLSMTCIVVGMVTLYLGVMVQADSGCVINWKGRELKCGSGIFVTN
EVHTWTEQYKFQADSPKRLSAAIGKAWEEGVCGIRSATRLENIMWKQISNELNHILLE
NDMKFTVVVGDANGILAQGKKMIRPQPMEHKYSWKSWGKAKIIGADIQNTTFIIDGPD
TPECPDGQRAWNIWEVEDYGFGVFTTNIWLKLRDSYTQMCDHRLMSAAIKDSKAVHAD
MGYWIESEKNETWKLARASFIEVKTCTWPKSHTLWSNGVLESEMIIPKIYGGPISQHN
YRPGYFTQTAGPWHLGKLELDFDLCEGTTVVVDEHCGNRGPSLRTTTVTGKIIHEWCC
RSCTLPPLRFRGEDGCWYGMEIRPVKEKEENLVRSMVSAGSGEVDSFSLGILCVSIMI
EEVMRSRWSRKMLMTGTLAVFLLLIMGQLTWNDLIRLCIMVGANASDRMGMGTTYLAL
MATFKMRPMFAVGLLFRRLTSREVLLLTIGLSLVASVELPNSLEELGDGLAMGIMMLK
LLTEFQPHQLWTTLLSLTFVKTTLSLDYAWKTTAMALSIVSLFPLCLSTTSQKTTWLP
VLLGSFGCKPLTMFLITENKIWGRKSWPLNEGIMAIGIVSILLSSLLKNDVPLAGPLI
AGGMLIACYVISGSSADLSLEKAAEVSWEQEAEHSGASHSILVEVQDDGTMKIKDEER
DDTLTILLKATLLAVSGVYPMSIPATLFVWYFWQKKKQRSGVLWDTPSPPEVERAVLD
NGIYRILQRGLLGRSQVGVGVFQDGVFHTMWHVTRGAVLMYQGKRLEPSWASVKKDLI
SYGGGWRFQGSWNTGEEVQVIAVEPGKNPKNVQTTPGTFKTPEGEVGAIALDFKPGTS
GSPIVNREGKIVGLYGNGVVTTSGTYVSAIAQAKASQEGPLPEIEDEVFKKRNLTIMD
LHPGSGKTRRYLPAIVREAIKRKLRTLILAPTRVVASEMAEALKGMPIRYQTTAVKSE
HTGREIVDLMCHATFTMRLLSPVRVPNYNMIIMDEAHFTDPASIAARGYISTRVGMGE
AAAIFMTATPPGSVEAFPQSNAVIQDEERDIPERSWNSGYDWITDFPGKTVWFVPSIK
SGNDIANCLRKNGKRVIQLSRKTFDTEYQKTKNNDWDYVVTTDISEMGANFRADRVID
PRRCLKPVILKDGPERVILAGPMPVTAASAAQRRGRIGRNQNKEGDQYVYMGQPLNND
EDHAHWTEAKMLLDNINTPEGIIPALFEPEREKSAAIDGEYRLRGEARKTFVELMRRG
DLPVWLSYKVASEGFQYSDRRWCFDGERNNQVLEENMDVEIWTKEGERKKLRPRWLDA
RTYSDPLALREFKEFAAGRRSVSGDLILEIGKLPQHLTLRAQNALDNLVMLHNSEQGG
KAYRHAMEELPDTIETLMLLALIAVLTGGVTLFFLSGKGLGKTSIGLLCVTASSALLW
MASVEPHWIAASIILEFFLMVLLIPEPDRQRTPQDNQLAYVVIGLLFMILTVAANEMG
LLETTKKDLGIGYVAAENHQHATMLDVDLHPASAWTLYAVATTVITPMMRHTIENTTA
NISLTAIANQAAILMGLDKGWPISKMDIGVPLLALGCYSQVNPLTLTAAVLMLVAHYA
IIGPGLQAKATREAQKRTAAGIMKNPTVDGIVAIDLDPVVYDAKFEKQLGQIMLLILC
TSQILLMRTTWALCESITLATGPLTTLWEGSPGKFWNTTIAVSMANIFRGSYLAGAGL
AFSLMKSLGGGRRGTGAQGETLGEKWKRQLNQLSKSEFNTYKRSGIMEVDRSEAKEGL
KRGETTKHAVSRGTAKLRWFVERNLVKPEGKVIDLGCGRGGWSYYCAGLKKVTEVKGY
TKGGPGHEEPIPMATYGWNLVKLHSGKDVFFMPPEKCDTLLCDIGESSPNPTIEEGRT
LRVLKMVEPWLRGNQFCIKILNPYMPSVVETLERMQRKHGGMLVRNPLSRNSTHEMYW
VSCGTGNIVSAVNMTSRMLLNRFTMAHRKPTYERDVDLGAGTRHVAVEPEVANLDIIG
QRIENIKNEHKSTWHYDEDNPYKTWAYHGSYEVKPSGSASSMVNGVVRLLTKPWDVIP
MVTQIAMTDTTPFGQQRVFKEKVDTRTPRAKRGTTQIMEVTAKWLWGFLSRNKKPRIC
TREEFTRKVRSNAAIGAVFVDENQWNSAKEAVEDERFWDLVHRERELHKQGKCATCVY
NMMGKREKKLGEFGKAKGSRAIWYMWLGARFLEFEALGFMNEDHWFSRENSLSGVEGE
GLHKLGYILRDISKIPGGNMYADDTAGWDTRVTEDDLQNEAKITDIMEPEHALLATSI
FKLTYQNKVVRVQRPAKNGTVMDVISRRDQRGSGQVGTYGLNTFTNMEVQLIRQMESE
GIFLPSELETPNLAERALDWLEKHGAERLKRMAISGDDCVVKPIDDRFATALTALNDM
GKVRKDIPQWEPSKGWNDWQQVPFCSHHFHQLIMKDGREIVVPCRNQDELVGRARVSQ
GAGWSLRETACLGKSYAQMWQLMYFHRRDLRLAANAICSAVPVDWVPTSRTTWSIHAH
HQWMTTEDMLSVWNRVWIDENPWMENKTHVSSWEEVPYLGKREDQWCGSLIGLTARAT
WATNIQVAINQVRRLIGNENYLDYMTSMKRFKNESDSEGALW"
mat_peptide 84..425
/product="anchored capsid protein"
mat_peptide 426..923
/product="membrane glycoprotein precursor"
mat_peptide 924..2408
/product="envelope protein"
mat_peptide 2409..3464
/product="nonstructural protein NS1"
mat_peptide 3465..4118
/product="nonstructural protein NS2A"
mat_peptide 4119..4508
/product="nonstructural protein NS2B"
mat_peptide 4509..6365
/product="nonstructural protein NS3"
mat_peptide 6366..6746
/product="nonstructural protein NS4A"
mat_peptide 6747..6815
/product="2K peptide"
mat_peptide 6816..7562
/product="nonstructural protein NS4B"
mat_peptide 7563..10259
/product="nonstructural protein NS5"
3'UTR 10263..10452
/note="indels in UTR have not been validated"
ORIGIN
1 catctggacc gacaagaaca gtttcgaatc ggaagcttgc ttaacgtagt tctaacagtt
61 ttttattaga gagcagatct ctgatgaaca accaacggaa aaagacgggt cgaccgtctt
121 tcaatatgct gaaacgcgcg agaaaccgcg tgtcaactgg ttcacagttg gcgaagagat
181 tctcaaaagg attgctttca ggccaaggac ccatgaaatt ggtgatggct ttcatagcat
241 ttctaagatt tctagccata cccccaacag caggaatttt ggctagatgg agctcattca
301 agaagaatgg agcaattaaa gtgttacggg gtttcaaaaa agagatctca agcatgttga
361 acataatgaa caggaggaaa agatccgtga ccatgctcct catgctgctg cccacagccc
421 tggcgtttca tttgaccaca cgagggggag agccacacat gatagttagt aagcaggaaa
481 gaggaaagtc actcttgttt aagacctctg cgggcgtcaa tatgtgcacc ctcattgcga
541 tggacttggg agagttatgt gaggacacaa tgacctacaa atgcccccgg atcactgagg
601 cggaaccaga tgacgttgac tgctggtgca atgccacaga cacatgggtg acctatggga
661 cgtgttctca aaccggcgaa caccgacgag agaaacgttc cgtggcactg gccccacacg
721 tgggacttgg tctagaaaca agaaccgaaa catggatgtc ctctgaaggc gcctggaaac
781 aaatacaaag agtggaaact tgggctttga gacacccagg attcacggtg atagcctttt
841 ttttagcaca tgctatagga acatccatca ctcagaaagg gatcattttc atcttgctga
901 tgctggtgac accatcaatg gccatgcgat gcgtgggaat aggcaacaga gacttcgttg
961 aaggactgtc aggagcaacg tgggtggacg tggtactgga gcacggaagc tgcgtcacca
1021 ccatggcaaa aaataaacca acattggaca ttgaactctt gaagacggag gtcacgaacc
1081 ctgccgtctt gcgcaaactg tgcattgaag ctaaaatatc aaacaccacc accgattcaa
1141 gatgtccaac acaaggagag gccacactgg tggaagaaca agacgcgaac tttgtgtgtc
1201 gccgaacgtt tgtggacaga ggctggggta atggctgcgg actattcgga aagggaagtc
1261 tattgacgtg tgccaagttc aagtgtgtga caaaactaga aggaaagata gttcaatatg
1321 aaaacctaaa atattcagtg atagtcactg tccacactgg ggaccagcac caggtgggaa
1381 acgagaccac agaacatgga acaattgcaa ccataacacc tcaagctccc acgtcggaaa
1441 tacagctgac cgactacgga gccctcacac tggactgctc acctagaaca gggctggact
1501 ttaatgagat ggtgctattg acaatgaaag aaaaatcatg gcttgtccac aaacaatggt
1561 ttctagactt gccactgcca tggacttcgg gggcttcaac atcccaagag acctggaaca
1621 gacaagattt gctggtcaca ttcaagacag ctcatgcaaa gaaacaggaa gtagtcgtat
1681 tgggatcaca ggaaggagca atgcatactg cgttgactgg ggcgacagaa atccagacgt
1741 caggaacgac aacaatcttc gcaggacacc tgaaatgcag actaaaaatg gataaactga
1801 ccttaaaggg gatgtcatat gtgatgtgca caggctcatt taagctagag aaggaagtgg
1861 ctgagaccca gcatggaact gttctagtgc aggtcaaata tgaaggaaca gacgcgccat
1921 gcaagatccc cttttcgacc caagatgaga aaggagtgac ccagaatggg agattgataa
1981 cagccaatcc catagttact gacaaagaaa aaccagtcaa cattgagaca gaaccacctt
2041 ttggtgagag ctacatcgtg gtaggggcag gcgaaaaagc tttgaaacta agctggttca
2101 agagaggaag cagcataggg aaaatgttcg aagcaaccgc ccgaggagca cgaaggatgg
2161 ctatcctggg agacaccgca tgggacttcg gttctatagg aggagtgttt acatctgtgg
2221 gaaaattggt acaccagatt tttggaaccg catatggggt tctgtttagc ggtgtttctt
2281 ggaccatgaa aataggaata gggattctgc tgacatggtt gggattaaat tcaaggagca
2341 cgtcactttc gatgacgtgc attgtagttg gcatggtcac actgtaccta ggagtcatgg
2401 ttcaagcgga ttcgggatgt gtgatcaact ggaagggcag agaacttaaa tgcggaagtg
2461 gcatttttgt cactaatgaa gtccacactt ggacagagca atacaaattc caggctgact
2521 ccccaaaaag actgtcagca gccattggaa aggcgtggga ggagggcgtg tgtggaattc
2581 gatcagccac gcgtcttgag aacatcatgt ggaagcagat atcaaatgaa ttgaaccaca
2641 ttttacttga gaatgacatg aaattcacag tggttgtagg agatgccaac ggaattttgg
2701 cccaaggaaa aaaaatgatt aggccacaac ccatggaaca caaatactca tggaaaagct
2761 ggggaaaagc taaaatcata ggagcagaca tacaaaatac caccttcatt atcgacggcc
2821 cagacacccc agaatgtcct gatggccaaa gagcatggaa catttgggaa gttgaggact
2881 atgggtttgg agttttcacg acaaacatat ggctgaaatt gcgtgactcc tacacccaaa
2941 tgtgtgacca ccggctaatg tcagctgcca tcaaggacag caaggcagtc catgctgaca
3001 tggggtactg gatagaaagt gaaaagaacg aaacctggaa gttggcgaga gcctccttca
3061 tagaagtcaa aacatgcacc tggccgaaat ctcacactct atggagcaat ggagttttgg
3121 aaagtgaaat gataatccca aagatatatg gaggaccaat atctcagcac aactacagac
3181 cagggtattt cacacaaaca gcagggccat ggcacctagg taagttggaa ctggattttg
3241 acttgtgtga aggcaccaca gttgttgtgg atgaacattg tggaaatcga ggtccatctc
3301 tcagaaccac aacagtcaca ggaaagataa tccatgaatg gtgttgcaga tcctgcacgc
3361 tacccccctt acgtttcaga ggagaagacg ggtgttggta tggcatggaa atcagaccag
3421 tgaaggagaa ggaggagaat ctagttaggt caatggtctc tgcagggtca ggagaagtgg
3481 acagtttttc attaggaata ctatgcgtat caataatgat tgaagaagtg atgagatcca
3541 gatggagtag aaagatgctg atgactggaa cactggctgt cttcctcctt cttataatgg
3601 gacaactgac atggaatgat ctgattaggt tatgcatcat ggtcggagct aacgcttcag
3661 acaggatggg gatgggaaca acgtacctag ccttgatggc tactttcaaa atgagaccaa
3721 tgttcgctgt agggctatta ttccgcagac taacatccag agaagttctt ctcctaacga
3781 ttggattaag cctggtggca tccgtggagc taccaaattc cttggaggag ctaggggatg
3841 gacttgcaat gggtatcatg atgttaaaat tgttgactga atttcagcca caccagttat
3901 ggaccacctt attgtctctg acatttgtca aaacaactct ctcattggat tatgcatgga
3961 aaacaacggc tatggcactg tctatcgtat ctctctttcc tttatgcctg tctacgacct
4021 cccaaaaaac aacatggctt ccggtgctgt taggatcttt tggatgcaaa ccattaacca
4081 tgtttcttat aacagaaaat aaaatctggg gaaggaaaag ttggcccctc aatgaaggaa
4141 ttatggctat tggaatagtc agcattctac taagctcact cctcaaaaat gatgtgccgt
4201 tggccgggcc attaatagct ggaggcatgc taatagcatg ttatgtcata tccggtagct
4261 cagccgattt atcattggag aaagcggctg aagtatcctg ggaacaagaa gcagaacact
4321 ccggtgcctc acacagcata ttagtagagg tccaagatga tggaactatg aaaataaaag
4381 atgaagagag ggatgacaca ctcaccatac tccttaaagc aactttgctg gcagtctcag
4441 gagtgtaccc aatgtcaata ccagcaactc tttttgtgtg gtatttttgg cagaaaaaga
4501 aacagagatc aggagtgtta tgggacacac ccagccctcc ggaagtggaa agagcagttc
4561 ttgataatgg catctataga atcttgcaaa gaggattgtt gggcaggtcc caagtaggag
4621 tgggagtttt ccaagacggc gtgttccaca caatgtggca cgttaccagg ggagctgtcc
4681 ttatgtacca agggaagaga ctggaaccaa gctgggccag tgtgaaaaag gacttgatct
4741 catatggagg aggttggagg ttccaaggat catggaacac gggagaagaa gtgcaggtaa
4801 tagctgttga accaggaaaa aaccccaaaa atgtacagac aacgccgggc acctttaaga
4861 ctcctgaagg cgaagttgga gccatagctc tagatttcaa acccggcaca tctggatctc
4921 ccatcgtgaa cagagaggga aaaatagtgg gtctgtatgg aaatggagtg gtgacaacaa
4981 gtggaaccta cgtcagtgcc attgcccaag ctaaagcatc acaggaaggg cctctaccag
5041 agattgagga cgaggtattt aagaaaagaa acttaacaat aatggacctg cacccaggat
5101 cagggaaaac aagaagatat cttccagcca tagtccgtga ggccataaaa aggaaactgc
5161 gtacgttaat cctggctccc acaagagttg tcgcctctga aatggcagag gcactcaagg
5221 gaatgccaat aagatatcag acaacagcag tgaagagtga acacacagga agggagatag
5281 ttgacctcat gtgccacgct acttttacca tgcgtctctt atccccagtg agagttccca
5341 attacaacat gatcattatg gatgaagcac attttaccga tccagctagc atagcggcca
5401 gagggtacat ctcaacccga gtgggtatgg gtgaagcagc tgcgatcttt atgacagcca
5461 ctcccccagg atcggtggag gcctttccac agagcaatgc agttatccaa gatgaggaaa
5521 gagacattcc tgagagatca tggaactcag gctacgactg gatcactgac tttccaggta
5581 aaacagtctg gtttgttcca agcattaaat caggaaatga cattgccaac tgtttaagaa
5641 agaacggaaa acgggtaatc caattgagca gaaaaacctt tgacactgag taccagaaaa
5701 caaaaaacaa tgactgggac tatgttgtca caacagacat ttctgaaatg ggggcaaatt
5761 tccgggccga cagggtaata gacccaaggc ggtgcttgaa accggtaata ctaaaagatg
5821 gtccagagcg tgtcattcta gccggaccga tgccagtgac tgcggccagt gctgcccaga
5881 ggagaggaag aattggaagg aaccaaaaca aggaaggtga tcagtatgtt tatatgggac
5941 agcctttaaa taatgatgag gatcacgctc attggacaga agcaaaaatg ctccttgaca
6001 atataaacac accagaaggg atcatcccag ccctttttga gccagagaga gaaaagagtg
6061 cagcaataga cggggagtac agactgcggg gagaagcaag gaaaacgttc gtggagctca
6121 tgagaagagg agatctacca gtttggctat cctacaaagt agcctcagaa ggtttccagt
6181 actccgacag aaggtggtgc tttgatgggg aaaggaacaa ccaggtgttg gaggagaaca
6241 tggacgtgga gatctggaca aaggaaggag aaagaaagaa attgcgacct cgctggttgg
6301 acgccagaac atactctgat ccattggccc tgcgcgagtt taaagagttc gcagcaggaa
6361 gaagaagtgt ctcaggtgac ctgatattgg aaatagggaa acttccacaa catttgacgt
6421 taagagccca gaatgctctg gacaacttgg tcatgttgca caattccgaa caaggaggaa
6481 aagcctacag acatgccatg gaggaactac cagacaccat agaaacattg atgctactag
6541 ctttgatagc tgtgttgact ggtggagtga cgctgttctt cctatcagga aaaggcctag
6601 ggaaaacatc cattggcttg ctctgtgtga cggcctcaag cgcactgtta tggatggcca
6661 gtgtggagcc ccattggata gcggcctcca tcatactaga gttctttttg atggtgctgc
6721 tcattccaga gccagacaga cagcgcactc cacaggacaa ccagctagca tatgtggtga
6781 taggtttgtt attcatgata ctgacagtgg cagccaatga gatgggatta ttggaaacca
6841 caaagaaaga cctggggatt ggctatgtag ccgccgaaaa ccaccaacat gccacaatgc
6901 tggacgtaga cctacaccca gcttcagcct ggaccctcta tgcagtagcc acaacagtca
6961 tcactcccat gatgagacac acaattgaaa atacaacggc aaacatttcc ctgaccgcca
7021 ttgcaaatca ggcagctata ttgatgggac ttgacaaggg atggccaata tcgaagatgg
7081 acataggagt tccacttctc gccttagggt gctattccca ggtgaaccca ttgacactga
7141 cagcggcggt gttgatgtta gtggctcatt atgccataat tggaccagga ctgcaagcaa
7201 aggccactag agaagcccaa aaaaggacag cagccggaat aatgaaaaat ccaaccgtag
7261 acgggattgt tgcaatagac ttggatcctg tggtttatga tgcaaaattt gaaaaacaac
7321 taggccaaat aatgttactg atactttgta catcacagat cctcttgatg cggaccacat
7381 gggccttgtg tgaatccatc acactggcta ctggacccct gaccactctc tgggagggat
7441 ctccaggaaa attctggaat accacaatag cagtgtccat ggcaaatatt ttcaggggaa
7501 gttatctagc aggagcaggt ctggctttct cattgatgaa atctttagga ggaggtagga
7561 gaggcacggg agctcaaggg gaaacactgg gagagaaatg gaaaagacag ttgaaccaac
7621 tgagcaagtc agaattcaac acctacaaaa ggagtgggat tatggaggtg gacagatccg
7681 aagccaaaga gggactgaaa agaggagaaa caaccaaaca tgcagtgtca agaggaacag
7741 ccaaactgag gtggtttgtg gagaggaacc tcgtgaaacc agaaggaaaa gtcatagacc
7801 tcggttgtgg aagaggtggc tggtcatatt attgtgctgg gctgaagaaa gttactgaag
7861 tgaagggata cacaaaagga ggacctggac atgaggaacc tatcccaatg gcgacctatg
7921 gatggaacct agtaaaacta cactctggaa aggatgtatt ttttatgcca cctgagaaat
7981 gtgacactct tctgtgtgat attggtgagt cctctccgaa tccaactata gaagaaggaa
8041 gaacgttacg tgttctaaaa atggtggaac catggctcag aggaaaccaa ttctgcataa
8101 aaatcctaaa tccttacatg ccaagtgtgg tagaaactct ggagcgaatg caaagaaaac
8161 atggagggat gctagtgcga aacccactct caagaaattc tacccatgaa atgtattggg
8221 tttcatgtgg aacaggaaac attgtgtcgg cagtgaacat gacatccaga atgttactga
8281 accgattcac aatggctcac aggaagccaa catatgaaag agacgtggac ttaggcgctg
8341 gaacaagaca tgtggcagtg gaaccagagg tagccaacct agatatcatt ggccagagga
8401 tagaaaatat aaaaaatgaa cacaagtcaa catggcatta tgatgaggac aatccataca
8461 aaacatgggc ctatcatgga tcatatgagg tcaagccatc aggatcagcc tcatctatgg
8521 tgaatggagt ggtgagattg ctcacgaaac catgggatgt catccccatg gtcacacaaa
8581 tagctatgac tgataccaca ccctttggac aacagagagt gtttaaagag aaagttgaca
8641 cgcgcacacc aagagcaaaa cgaggcacaa cacagattat ggaggtgaca gccaagtggt
8701 tatggggttt cctttccaga aacaaaaaac ccagaatctg cacaagagag gagttcacaa
8761 gaaaggttag gtcaaacgcg gcaataggag cagtgttcgt tgatgaaaac caatggaact
8821 cagcaaaaga agcagtggaa gacgaaaggt tttgggatct tgtgcacaga gagagggagc
8881 ttcataaaca gggaaaatgt gccacgtgtg tctacaacat gatggggaag agagagaaaa
8941 aattaggaga gtttggaaag gcaaaaggaa gtcgtgcaat atggtacatg tggctgggag
9001 cacgctttct ggagttcgaa gcccttggtt ttatgaatga agatcactgg tttagtagag
9061 agaattcact cagtggagtg gaaggagaag gactgcacaa acttggatac atactcagag
9121 acatatcaaa gattccgggg ggaaatatgt atgcagatga tacagccgga tgggacacaa
9181 gagtaacaga ggatgacctc cagaatgagg ctaaaatcac tgacatcatg gagcctgaac
9241 atgctctatt ggctacgtca atttttaagc tgacttatca aaacaaggtg gtgagggtgc
9301 aaagaccagc aaaaaatgga accgtgatgg atgttatatc cagacgtgat cagagaggga
9361 gtggacaggt cggaacttat ggcttaaata ctttcaccaa tatggaggtc caactaataa
9421 gacaaatgga gtctgaggga atctttttac ccagcgaatt ggaaaccccc aacctagctg
9481 agagggctct tgactggtta gaaaaacatg gcgccgaaag gctgaaacga atggcaatca
9541 gcggagatga ttgcgtggtg aaaccaattg acgacaggtt cgcaacagcc ttaacagctc
9601 tgaatgacat gggaaaagta aggaaagaca taccgcagtg ggaaccttca aaaggatgga
9661 atgattggca gcaagtgcct ttttgttcac accatttcca ccaactgatc atgaaggatg
9721 ggagggaaat agtggtgcca tgccgcaacc aagatgaact tgtgggcagg gctagagtat
9781 cacaaggcgc cggatggagc ctgagagaaa ctgcttgcct aggcaagtca tatgcacaaa
9841 tgtggcagct gatgtacttc cacaggagag acctgagact agcggctaac gctatctgtt
9901 cagccgtccc agttgattgg gtcccaacca gccgcacaac ctggtcaatc catgcccacc
9961 accaatggat gacaacagaa gacatgttat cagtgtggaa tagggtttgg atagacgaaa
10021 acccatggat ggagaacaaa actcatgtat ccagttggga agaagttcca tacctaggaa
10081 aaagggaaga tcaatggtgt ggatccctga taggcttgac agcgagggcc acctgggcca
10141 ccaacataca agtagccata aaccaagtga gaaggctcat cgggaatgag aattatttag
10201 attacatgac atcaatgaag agattcaaga atgagagtga ttccgaagga gcactctggt
10261 aagtcaacac actcatgaaa taaaggaaaa tagaagatca aacaaagtaa gaagtcaggc
10321 cagattaagc catagcacgg aaagagctat gctgcctgtg agccccgtcc aaggacgtaa
10381 aatgaagtca ggccgaaagc cacggattga gcaagccgtg ctgcctgtgg ctccatcgtg
10441 gggatgtagc tc
//
parallel_blast¶
Parallel blast is a wrapper script around the blast commands as well as diamond. It utilizes GNU Parallel to run the commands in parallel by splitting up the input fasta files and distributes them across multiple subprocesses. If it detects that it is running inside of a PBS or SGE job it will run the job on multiple hosts that may be allocated to the job.
parallel_blast requires that you have gnu parallel installed and in your environments PATH as well as diamond and/or blastn/blastx/blastp.
Usage¶
You can get all the arguments that can be supplied via the following
$> parallel_blast --help
Examples¶
For the examles below assume you have an input fasta in the current directory
called input.fasta
Running blastn¶
$> parallel_blast input.fasta output.blast --ninst 4 --db /path/to/nt \
--blast_exe blastn --task megablast --blast_options "--evalue 0.01"
[cmd] /path/to/parallel -u --pipe --block 10 --recstart > --sshlogin 4/: /path/to/blastn -task megablast -db /path/to/nt -max_target_seqs 10 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" -query -
Notice how we had to quote the additional --blast_options
Running diamond¶
Diamond v0.7.9 is the version that was tested with parallel_blast. As diamond is still in development the options may change in future versions and parallel_blast may not run them correctly. Please submit a new issue if you find any issues.
$> parallel_blast input.fasta out.blast --ninst 4 --db /path/to/diamondnr \
--blast_exe diamond --task blastx --blast_options "--tmpdir dtmp"
[cmd] /path/to/parallel -u --pipe --block 10 --recstart > --cat --sshlogin 1/: /path/to/diamond blastx --threads 4 --db /path/to/diamondnr --query {} --compress 0 -a out.blast
Notice how even though we specified --ninst 4
that --sshlogin 1/:
was used
and --threads 4
was set instead.
Note In recent versions of diamond, diamond outputs a daa binary file instead of a tab separated file. parallel_blast automatically converts the diamond output from daa to tab format for you but leaves the daa file behind(Same name as the output file you specify, but with the extension .daa)
Command that is run¶
You will notice in the examples above that when you run parallel_blast that it outputs the command that it is running in case you want to copy/paste it and run it yourself sometime.
You might notice that the command does not include all the quoted arguments such
as the --recstart
argument which should be --recstart ">"
as well as
the --outfmt
which should be quoted as --outfmt "6 ..."
. If you intend on
rerunning the command you will have to add the quotes manually.
Running inside of a PBS or SGE Job¶
parallel_blast is able to detect if it is running inside of a PBS or SGE job by
looking to see if PBS_NODEFILE
or PE_HOSTFILE
is set in the environment’s
variables.
If it finds either of them it will run the job by supplying --sshlogin
for each
host it finds in the file.
PBS_NODEFILE
and PE_HOSTFILE
have different syntax so parallel_blast first
builds a CPU,NODENAME list from them.
PBS_NODEFILE¶
This file is parsed and counts how many of each unique host is listed such that the following PBS_NODEFILE:
node1.localhost
node2.localhost
node2.localhost
node3.localhost
node3.localhost
node3.localhost
would run 1 instance on node1.localhost, 2 instances on node2.localhost and 3 instances on node3.localhost
PE_HOSTFILE¶
This file is almost in the exact syntax that parallel_blast uses so it is almost a 1-to-1 mapping.
Diamond and multiple hosts¶
Since diamond utilizes threads much more efficiently than blast, for each unique
host in a job only 1 instance is launched but the -p
option is set to the number
of CPUS for each host listed in the PE_HOSTFILE
or PBS_NODEFILE
degen_regions¶
Finds all degenerate bases in a given fasta input file that may contain multiple sequeces and reports their position as well as the annotated gene name that contains them.
The fasta file must be previously aligned to the query sequence. That is, if you are using a genbank annotation file or having the script download it for you, you should have aligned all your input sequences to that sequence.
The annotation is retrieved via supplied genbank accession, genbank file path or gene tab/csv file.
Using Genbank Files¶
If you already have downloaded the genbank annotation file(typically the extension is .gb) you can use the –gb-file argument
The following will use the test input fasta file as well as the test input genbank file to find all degenerate bases and will put the output in a tab separated file called output.tsv
degen_regions -i tests/Den4_MAAPS_TestData16.fasta -o output.tsv --gb-file tests/testinput/sequence.gb
Fetching Genbank Files Automatically¶
If you want the script to automatically fetch the Genbank annotation file from the internet you can use the –gb-id option and specify an accession number.
degen_regions -i tests/Den4_MAAPS_TestData16.fasta -o output.tsv --gb-id KJ189367
Using tab/csv file of gene annotation info¶
If you have a tab/csv file of gene annotations you can supply that using the –tab-file argument
You can read more about the format of the tab/csv annotation file in the degen docs
degen_regions -i tests/Den4_MAAPS_TestData16.fasta -o output.tsv --gb-file tests/testinput/sequence.gb
Manually specify CDS¶
You can use the --cds
argument to set the coding region.
This argument should be comma separated such as start,stop
.
Specifying this argument will override any other cds found in the tab file, genbank
file or fetched genbank file.
The following would mark all locations as NON-CODING as you are specifying that only position 1 is coding
degen_regions -i tests/Den4_MAAPS_TestData16.fasta -o output.tsv --gb-file tests/testinput/sequence.gb --cds 1,1
Output¶
The output is a simple tab separated file
seq id nt Position aa position nt composition aa composition gene name
----------------------------------------- ------------- ------------- ---------------- ---------------- -------------------------------
721 991 331 WCA S/T envelope protein
721 1307 436 AYA I/T envelope protein
721 1826 609 AYA I/T envelope protein
721 1865 622 GRA E/G envelope protein
721 7766 2589 ARA K/R nonstructural protein NS5
2055_Den4/AY618992_1/Thailand/2001/Den4_1 1927 643 RAC D/N envelope protein
2055_Den4/AY618992_1/Thailand/2001/Den4_1 2833 945 YCG P/S nonstructural protein NS1
2055_Den4/AY618992_1/Thailand/2001/Den4_1 3565 1189 YAT H/Y nonstructural protein NS2A
2055_Den4/AY618992_1/Thailand/2001/Den4_1 6271 2091 RAA E/K nonstructural protein NS3
2055_Den4/AY618992_1/Thailand/2001/Den4_1 8656 2886 YAT H/Y nonstructural protein NS5
2055_Den4/AY618992_1/Thailand/2001/Den4_1 8998 3000 YAG */Q nonstructural protein NS5
2055_Den4/AY618992_1/Thailand/2001/Den4_1 9811 3271 YCC P/S nonstructural protein NS5
2055_Den4/AY618992_1/Thailand/2001/Den4_1 10542 3515 AGN NON-CODING -
2055_Den4/AY618992_1/Thailand/2001/Den4_1 10543 3515 NNN NON-CODING -
2055_Den4/AY618992_1/Thailand/2001/Den4_1 10541 3514 NNN NON-CODING -
2055_Den4/AY618992_1/Thailand/2001/Den4_1 10539 3514 NNN NON-CODING -
2055_Den4/AY618992_1/Thailand/2001/Den4_1 10546 3516 NNN NON-CODING -
2055_Den4/AY618992_1/Thailand/2001/Den4_1 10544 3515 NNN NON-CODING -
2055_Den4/AY618992_1/Thailand/2001/Den4_1 10542 3515 NNN NON-CODING -
1942_Den4/AY618992_1/Thailand/2001/Den4_1 4540 1514 RTA I/V nonstructural protein NS3
1942_Den4/AY618992_1/Thailand/2001/Den4_1 10177 3393 MCA P/T nonstructural protein NS5
1942_Den4/AY618992_1/Thailand/2001/Den4_1 10546 3516 NNN NON-CODING -
1942_Den4/AY618992_1/Thailand/2001/Den4_1 10544 3515 NNN NON-CODING -
1942_Den4/AY618992_1/Thailand/2001/Den4_1 10542 3515 NNN NON-CODING -
1875_Den4/AY618992_1/Thailand/2001/Den4_1 1514 505 AYG M/T envelope protein
1875_Den4/AY618992_1/Thailand/2001/Den4_1 3056 1019 ARA K/R nonstructural protein NS1
1875_Den4/AY618992_1/Thailand/2001/Den4_1 3058 1020 KCA A/S nonstructural protein NS1
1875_Den4/AY618992_1/Thailand/2001/Den4_1 3073 1025 WTT F/I nonstructural protein NS1
1875_Den4/AY618992_1/Thailand/2001/Den4_1 3491 1164 AYC I/T nonstructural protein NS2A
1875_Den4/AY618992_1/Thailand/2001/Den4_1 3895 1299 RTG M/V nonstructural protein NS2A
1875_Den4/AY618992_1/Thailand/2001/Den4_1 7445 2482 GYA A/V nonstructural protein NS4B
948_Den4/AY618992_1/Thailand/2001/Den4_1 2819 940 ARC N/S nonstructural protein NS1
871_Den4/AY618992_1/Thailand/2001/Den4_1 2947 983 RCC A/T nonstructural protein NS1
871_Den4/AY618992_1/Thailand/2001/Den4_1 3058 1020 KCA A/S nonstructural protein NS1
871_Den4/AY618992_1/Thailand/2001/Den4_1 3073 1025 WTT F/I nonstructural protein NS1
871_Den4/AY618992_1/Thailand/2001/Den4_1 3116 1039 GYG A/V nonstructural protein NS1
871_Den4/AY618992_1/Thailand/2001/Den4_1 3181 1061 RTW I/V nonstructural protein NS1
871_Den4/AY618992_1/Thailand/2001/Den4_1 3179 1060 RTW I/V nonstructural protein NS1
871_Den4/AY618992_1/Thailand/2001/Den4_1 3338 1113 ART N/S nonstructural protein NS1
871_Den4/AY618992_1/Thailand/2001/Den4_1 3362 1121 ARA K/R nonstructural protein NS1
871_Den4/AY618992_1/Thailand/2001/Den4_1 3373 1125 WCR S/T nonstructural protein NS1
871_Den4/AY618992_1/Thailand/2001/Den4_1 3371 1124 WCR S/T nonstructural protein NS1
871_Den4/AY618992_1/Thailand/2001/Den4_1 4314 1439 ATV I/M nonstructural protein NS2B
871_Den4/AY618992_1/Thailand/2001/Den4_1 7045 2349 WCC S/T nonstructural protein NS4B
871_Den4/AY618992_1/Thailand/2001/Den4_1 10536 3513 GAW NON-CODING -
871_Den4/AY618992_1/Thailand/2001/Den4_1 10537 3513 YCA NON-CODING -
947_Den4/AY618992_1/Thailand/2001/Den4_1 2971 991 YTY F/L nonstructural protein NS1
947_Den4/AY618992_1/Thailand/2001/Den4_1 2969 990 YTY F/L nonstructural protein NS1
947_Den4/AY618992_1/Thailand/2001/Den4_1 6763 2255 YTT F/L 2K peptide
1793_Den4/AY618992_1/Thailand/2001/Den4_1 223 75 MAG K/Q anchored capsid protein
1793_Den4/AY618992_1/Thailand/2001/Den4_1 556 186 RCC A/T membrane glycoprotein precursor
1793_Den4/AY618992_1/Thailand/2001/Den4_1 586 196 RGT G/S membrane glycoprotein precursor
1793_Den4/AY618992_1/Thailand/2001/Den4_1 613 205 YCA P/S membrane glycoprotein precursor
1793_Den4/AY618992_1/Thailand/2001/Den4_1 2875 959 YCG P/S nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 2943 982 AAN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 2944 982 NNG GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 2942 981 NNG GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 2976 993 ATN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 2977 993 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 2975 992 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 2973 992 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 2980 994 NTG GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 2987 996 ANN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 2986 996 ANN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 2989 997 NGT GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 2996 999 TNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 2995 999 TNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3001 1001 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 2999 1000 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 2997 1000 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3004 1002 NCC GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3073 1025 NTT GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3086 1029 ARC N/S nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3095 1032 CNG GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3116 1039 GNG GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3144 1049 GAN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3159 1054 GAN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3160 1054 NNC GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3158 1053 NNC GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3206 1069 GNC GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3235 1079 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3233 1078 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3231 1078 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3238 1080 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3236 1079 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3234 1079 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3241 1081 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3239 1080 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3237 1080 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3244 1082 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3242 1081 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3240 1081 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3247 1083 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3245 1082 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3243 1082 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3250 1084 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3248 1083 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3246 1083 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3253 1085 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3251 1084 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3249 1084 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3256 1086 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3254 1085 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3252 1085 NNN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3316 1106 NGG GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3337 1113 NAT GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3341 1114 GNA GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3408 1137 ATN GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3412 1138 NTG GAPFOUND nonstructural protein NS1
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3493 1165 MCC P/T nonstructural protein NS2A
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3509 1170 ANT GAPFOUND nonstructural protein NS2A
1793_Den4/AY618992_1/Thailand/2001/Den4_1 3837 1280 TTN GAPFOUND nonstructural protein NS2A
1793_Den4/AY618992_1/Thailand/2001/Den4_1 6185 2062 ARG K/R nonstructural protein NS3
1793_Den4/AY618992_1/Thailand/2001/Den4_1 6187 2063 RAR E/K nonstructural protein NS3
1793_Den4/AY618992_1/Thailand/2001/Den4_1 6185 2062 RAR E/K nonstructural protein NS3
1793_Den4/AY618992_1/Thailand/2001/Den4_1 6614 2205 TYT F/S nonstructural protein NS4A
1793_Den4/AY618992_1/Thailand/2001/Den4_1 6650 2217 ARA K/R nonstructural protein NS4A
1793_Den4/AY618992_1/Thailand/2001/Den4_1 8630 2877 ART N/S nonstructural protein NS5
1793_Den4/AY618992_1/Thailand/2001/Den4_1 8844 2949 AAN GAPFOUND nonstructural protein NS5
1793_Den4/AY618992_1/Thailand/2001/Den4_1 9938 3313 AYT I/T nonstructural protein NS5
1793_Den4/AY618992_1/Thailand/2001/Den4_1 9941 3314 GRC D/G nonstructural protein NS5
1793_Den4/AY618992_1/Thailand/2001/Den4_1 10015 3339 RTT I/V nonstructural protein NS5
1793_Den4/AY618992_1/Thailand/2001/Den4_1 10087 3363 NGR GAPFOUND nonstructural protein NS5
1793_Den4/AY618992_1/Thailand/2001/Den4_1 10085 3362 NGR GAPFOUND nonstructural protein NS5
1901_Den4/AY618992_1/Thailand/2001/Den4_1 15 6 AAN NON-CODING 5'UTR
1901_Den4/AY618992_1/Thailand/2001/Den4_1 111 38 TTN GAPFOUND anchored capsid protein
1901_Den4/AY618992_1/Thailand/2001/Den4_1 2279 760 GYT A/V envelope protein
1901_Den4/AY618992_1/Thailand/2001/Den4_1 8798 2933 ARA K/R nonstructural protein NS5
1901_Den4/AY618992_1/Thailand/2001/Den4_1 10195 3399 RAG E/K nonstructural protein NS5
1901_Den4/AY618992_1/Thailand/2001/Den4_1 10366 3456 RGG NON-CODING 3'UTR
1934_Den4/AY618992_1/Thailand/2001/Den4_1 15 6 AAN NON-CODING 5'UTR
1934_Den4/AY618992_1/Thailand/2001/Den4_1 111 38 TTN GAPFOUND anchored capsid protein
1934_Den4/AY618992_1/Thailand/2001/Den4_1 998 333 GMT A/D envelope protein
1934_Den4/AY618992_1/Thailand/2001/Den4_1 4515 1506 TTM F/L nonstructural protein NS3
1934_Den4/AY618992_1/Thailand/2001/Den4_1 8798 2933 ARA K/R nonstructural protein NS5
plot_muts¶
Example¶
plot_muts --refs tests/testinput/refs.fas --query tests/testinput/query.fas --out plot.png
The --out
option is optional. If it is not provided, the plot will pop up on
the user’s screen automatically. If this does not work, try saving the image using --out
instead.
Example Output¶
Input File Requirements¶
The input must be fasta format. Both the query and ref files can have any number of sequences.
The year should be the last part of the ID, preceded by a quadruple underscore. e.g.:
>some|info|blah_blah____2001_09_2010
>some____1995
>some____09/09/2012
If the ID uses ‘/’ rather than underscore, plot_muts currently accepts the year as the fourth field. e.g.:
>some/info/blah/1995
>some/info/blah/1995/more/info
fasta¶
fasta is a very simple script to help mangle fasta files. Currently it only supports the ability to convert fasta files that have sequences that span multiple new lines into single lines.
Later on, it may be expanded more to include even more useful fasta features.
Usage¶
fasta --help
Examples¶
The following examples all use the test fasta file found under
tests/testinput/col.fasta
>sequence1 some description !@#$%^&*()_+-=[]{}.,></?';:"
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>sequence2
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
Read fasta input from standard input¶
The following could output the fasta sequences as one line to your terminal(stdout) but reading from the pipe. This is useful if you want to use it in a pipeline.
$> cat tests/testinput/col.fasta | fasta -
Read fasta input from file¶
The following could output the fasta sequences as one line to your terminal(stdout) as well, but reading straight from the file.
$> fasta tests/testinput/col.fasta
Simple shell pipeline using fasta¶
The following is a simple shell pipeline to count how many A’s there are in the
sequence lines. There should be 160 since col.fasta
is 80 characters per line
and only the first line of each sequence has A
and there are 2 sequences.
$> fasta tests/testinput/col.fasta | grep -v '>' | grep -Eo '[Aa]' | wc -l
160
AMOS¶
AMOS is a file format that is similar to any assembly file format such as ACE or SAM. It contains information about each read that is used to assemble each contig.
The format is broken into different message blocks. For the Ray assembler, it produces an AMOS file that is broken into 3 types of message blocks
RED
{RED iid:\d+ eid:\d+ seq: [ATGC]+ . qlt: [A-Z]+ }
- iid
Integer identifier
- eid
Same as iid?
- seq
Sequence data
- qlt
Should be quality, but is only a series of D’s from Ray assembler
TLE
{TLE src:\d+ off:\d+ clr:\d+,\d+ }
- src
RED iid that was used
- off
One would think offset, but unsure what it actually means
- clr
Not sure what this is either
CTG
{CTG iid:\d+ eid:\w+ com: .*$ . seq: [ATGC]+ . qlt: [A-Z]+ . {TLE ... } }
- iid
integer id of contig
- eid
contig name
- com
Communication software that generated this contig
- seq
Contig sequence data
- qlt
Supposed to be contig quality data, but for Ray it only produces D’s
- TLE
0 or more TLE blocks that represent RED sequences that compose the contig
Parsing¶
bio_bits contains an interface to parse a given file handle that has been opened on an AMOS file.
To read in the AMOS file you simply do the following
from bio_bits import amos
a = None
with open('AMOS.afg') as fh:
a = amos.AMOS(fh)
CTG¶
To get information about the contigs(CTG) you can access the .ctgs
attribute.
The contigs are indexed based on their iid so to get the sequence of contig iid 1
you would do the following:
ctg = a.ctgs[1]
seq = ctg.seq
To retrieve all the reads(RED) that belong to a specific contig:
reads = []
for tle in ctg.tlelist:
reads.append(a.reds[tle.src])
RED¶
To get information about the reads(RED) you can access the .reds
attribute.
The reds are indexed based on their iid so to get the sequence of red iid 1 you
would do the following:
red = a.reds[1]
seq = red.seq
If you want to convert a RED entry into anything you can use the .format
method. The .format
method allows you to utilize any of the properties of
a RED object such as .iid
, .eid
, .seq
, .qlt
. You can see in
the examples below how to do this.
Examples¶
Here is an example of how to convert all RED blocks into a single fastq file
from bio_bits import amos
# Fastq format string
fastq_fmt = '@{iid}\n{seq}\n+\n{qlt}'
with open('amos.fastq','w') as fh_out:
with open('AMOS.afg') as fh_in:
for iid, red in amos.AMOS(fh_in).reds.items():
fq = red.format(fastq_fmt)
fh_out.write(fq + '\n')
CHANGELOG¶
Version 1.3.0¶
- Added fasta script that removes newlines from fasta sequences
Version 1.2.1¶
- Fixed some python3 and python2.6 incompatability issues
- Fixed some old bio_pieces references
- Added some simple tests for plot_muts
Version 1.2.0¶
- Renamed project to bio_bits to fix naming issue with other project
- GPL License added
- degen_regions script added
- parallel_blast added
- plot_muts script added
Version 1.1.0¶
- Renamed parse_contigs to group_references to better name functionality
- group_references now supports bam files
Version 1.0.0¶
- Version bump. Starting here we will employ semantic versioning
- Added version script to get version from project
Version 0.1.0¶
- Started project over to setup for Continuous Integration testing
- Added rename_fasta that can rename fasta sequence identifiers based on a input rename file
- Added travis, coveralls, readthedocs
- Added amos file parser that is specific to Ray assembler amos format
- Added format functionality for amos classes such that it is easy to convert to different formats
- Added amos2fastq to pull sequences out of AMOS files organized by their contigs.
- Added vcfcat.py, a commandline app for filtering and comparing vcf files.
- Completed documentation for vcfcat
- Added beast_checkpoint script and documentation
- Added beast_wrapper script that prints estimated time column in beast output
- Added beast_est_time script that allows you to easily get estimated time left from already running beast run
TODO¶
Todo
Could be possible to get beast_checkpoint to check for that scenario and use the last tree state that matches the last log state
(The original entry is located in /home/docs/checkouts/readthedocs.org/user_builds/bio-bits/checkouts/v1.3.0/docs/scripts/beast_checkpoint.rst, line 52.)