Welcome to the documentation for SMCSMC!¶
This is the reference manual for SMCSMC (or smcsmc
), a method and program for inferring ancestral population size and migration from whole genome sequences.
There are two versions of this documentation. The stable version describes functionality of the most recent conda
release, while the latest version documents new features in the development branch.
Introduction¶
SMCSMC, short for Sequential Monte Carlo for the Sequentially Markovian Coalescent, is a method for estimating ancestral population size and migration history from sequence data. It has several advantages over comparable methods, especially when you are interested in analysing complex demographic models.
The method uses a particle filter to sample from the posterior distribution of trees along the sequence and Variational Bayes to infer epoch specific demographic parameters over a given number of iterations.
smcsmc
takes as input optionally phased sequencing data formatted as segments, and provides utilities for analysing and visualising the inferred ancestral rates.
Installation¶
We highly recommend installing SMCSMC from conda
, as it comes packaged with all necessary dependencies. A seperate guide for manual compilation may be found in the developer reference. See here for a helpful guide to installing and using conda
to manage programs.
First add both conda-forge
and terhorst
to your channel lists (if they are not there already), then install smcsmc
.
conda config --add channels conda-forge
conda config --add channels terhorst
conda install -c luntergroup smcsmc
Basic Usage¶
smcsmc
can be run on the command line with smc2
or through the python API.
smc2 -h
Follow the Getting Started guide to become familiar with the basic structure and function of SMCSMC commands, then look at one of the tutorials for analysing simulated or real data. For a more complete guide to arguments, see Input Arguments.
Other Methods¶
SMCSMC is part of the PopSim consortium, and we are actively involved in building a framework to standardize population genetic analyses. Part of this involves making it easy to run the same analysis with many different methods. We have built smcsmc
with this goal in mind. For the latest information about comparisons between different population genetic software, including smc++
, stairwayplot
, msmc
, and dadi/fastcoal
, check out the PopSim analysis repository.
Citation¶
If you use smcsmc
in your work, please cite the following article:
- Henderson, D., Zhu, S. (Joe), & Lunter, G. (2018). Demographic inference using particle filters for continuous Markov jump processes. BioRxiv, 382218. https://doi.org/10.1101/382218
- Staab, P. R., Zhu, S., Metzler, D., & Lunter, G. (2015). scrm: efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics, 31(10), 1680–1682. https://doi.org/10.1093/bioinformatics/btu861
Getting Started¶
This guide will familiarise your with the concepts necessary to run analyses in SMCSMC and take you through the basic aspects of an analysis using toy data. You may also be interested in one of several examples using real and simulated data in the Tutorials section.
Note
This tutorial is intended to be run on a personal computer, and values have been scaled accordingly. For analysis on real data, we expect that a user will have access to a compute cluster, and we provide guidance about setting up smcsmc
to function with common architechtures in Clustering Computing.
Basic Concepts¶
SMCSMC is a particle filter, which means that a given number of particles are simulated, evaluated for an approximate likelihood, and resampled until demographic parameters have converged. Many of the options relate to the behaviour of this particle filter, and the remainder deal with the demographic model that you wish to infer. SMCSMC can infer complex demographic models involving several populations, though you should be weary of overspecifying your model. Run time increases drastically with the number of parameters needed to specify the model.
Input Format¶
Suppose we have the following data, saved into a file named toy.seg
. You can also download it here.
segment start | segment length | invariant | break | chr | genotype |
---|---|---|---|---|---|
1 | 521 | T | F | 1 | 01.0 |
522 | 2721 | T | F | 1 | 0111 |
3243 | 1758 | T | F | 1 | 10// |
5001 | 1296 | T | F | 1 | 0000 |
6297 | 1 | T | F | 1 | …. |
6298 | 4669 | T | F | 1 | 0110 |
10967 | 880 | T | T | 1 | 0100 |
1 | 708 | T | F | 2 | 1010 |
This gives the state of four haplotypes along the sequence, indicating segment starts and their lengths. If a segment is the last in a block before coordinates reset (i.e. when including multiple chromosomes in a single seg
file, it is called a genetic break. Genotypes in terms of either major/minor or ancestral/derived are given for each haplotype. Missing data is coded with a period (.
) whilst unphased variants are given with a forward slash (/
).
Note
This data format is similar to the input for msmc however, in the second column we require the length from the current segment to the next, rather than the number of called sites from the previous segment to the current one.
Segment files are specified by the seg
flag, or if more than one are used, by the segs
flag. Globs are encouraged for readability. In the case of multiple files, genetic breaks are inferred from the beginnings and ends of files. Additionally, we tell smcsmc
the number of samples we expect with nsam
, so it knows to check the seg
file for formatting issues.
Basic Arguements¶
In addition to the segment file, we need a few more pieces of information to kick off the particle filter.
Particle Count: We must specify a particle count (Np
) or use the default value of 1000. The number of particles refers to the number of individual ARGs which are simulated by SCRM
, higher numbers means that the model is more likely to converge on a reasonable answer, but increasing the particle count is computationally expensive. We recommend particle counts between 25 and 50 thousand for analysis. However, for exploratory work, smaller particle counts (a good starting point is 10 thousand) can be used to more effectively use computational resources. In this guide, intended for use on a personal computer, we will use 10 particles.
Number of Iterations: We also explictly define the number of expectation maximization iterations that we want to use (EM
). A good place to start is 15 iterations. smcsmc
looks for output before starting, so should you be unsatisfied with the convergence after 15 iterations, simply run the exact same program with a higher number of iterations. smcsmc
will find the output for the iterations already run and simply continue from where it left off.
Demographic Parameters: Several arguements are good to specify, especially when analysing real data as they help with convergence. Here we will use a mutation rate (mu
) of 1.25e-8
and a recombination rate (rho
) of 3e-9
. We give an effective population size (N0
) of 14312, and infer trees back to 4*N0
generations with tmax
. We set this to 3.5, which is 1.2 Mya with an N0
of 10000. smcsmc
infers demographic parameters as discrete over intervals. We specify 31 intervals evenly spaced on the log scale with P 133 133032 31*1
giving the end times of the first and last epochs in generations, and the pattern for their generation.
We also give the path to the output folder with o
.
Running smcsmc
¶
Tip
It is always a good idea to start a new conda
environment for a new analysis to ensure that there are no dependecy conflicts, though you can skip this step if you certain that your environment is correctly configured.
conda create --name smcsmc_tutorial
conda activate smcsmc_tutorial
Once smcsmc
is installed, we can format the arguements detailed above into a dictionary.
import os
args = {
'seg': 'toy.seg',
'nsam': '4',
'Np': '10',
'EM': '1',
'mu': '1.25e-8',
'rho': '3e-9',
'N0': '10000',
'tmax': '3.5',
'P': '133 133032 31*1',
'no_infer_recomb': '',
'smcsmcpath': os.path.expandvars('${CONDA_PREFIX}/bin/smcsmc'),
# The minimum segment length is usually best left
# as the default, however for this tutorial we set it
# artifically low.
'minseg': '10',
'o': 'smcsmc_output'
}
We directly use this dictionary with the run_smcsmc
command, which takes as its only arguement a dictionary of arguements.
import smcsmc
smcsmc.run_smcsmc(args)
If your installation has been successful, then this will begin the process of parsing the input, merging any given seg
files, starting the particle filter, and iterating through the EM
steps requested.
Output¶
If your smcsmc
has run correctly, the resulting output directory will look something like this, with a seperate folder for each EM iteration, and a seperate file for each chunk, if this option has been used.
output/
emiterN/
chunkN.out
chunkN.stdout
chunk.stderr
chunkfinal.out
merged.seg
merged.map
result.log
result.out
If you are following this tutorial and are only using a single input seg
file, you will not see merged.seg
or merged.map
as there was no need to generate them. The output for each chunk is given, along with stdout and sterr, and results are aggregated over all chunks each epoch into chunkfinal.out
. The final epoch will be post processed into result.out
. Output and debugging information along with useful information to help interpret the results of your model are given in result.log
.
An example of a results.out file is given here:
Iter Epoch Start End Type From To Opp Count Rate Ne ESS Clump
15 0 0 133 Coal 0 -1 307466.77 18.798649 6.1140424e-05 8177.8955 1.0272 -1
15 0 0 133 Coal 1 -1 324087.66 55.469939 0.00017115721 2921.2909 1.0225 -1
15 1 133 166.2 Coal 0 -1 167571.8 1.0002328 5.9689802e-06 83766.402 1.0791 -1
15 1 133 166.2 Coal 1 -1 174231.23 0.067992116 3.9024069e-07 1281260.5 1.0674 -1
15 2 166.2 207.68 Coal 0 -1 259801.05 40.479172 0.00015580835 3209.0707 1.1686 -1
15 2 166.2 207.68 Coal 1 -1 267803.96 0.00028150086 1.0511452e-09 4.7567166e+08 1.1522 -1
15 -1 0 1e+99 Delay -1 -1 2.6001037e+09 0 0 0 1 -1
15 -1 0 1e+99 LogL -1 -1 1 -31005174 -31005174 0 1 -1
15 0 0 133 Migr 0 1 178112.7 7.1251662 4.0003695e-05 0 1.0385 -1
15 0 0 133 Migr 1 0 186343.02 0.00022362676 1.2000812e-09 0 1.0338 -1
15 1 133 166.2 Migr 0 1 87712.37 0.5934193 6.7655144e-06 0 1.0819 -1
15 1 133 166.2 Migr 1 0 91253.03 3.0913275 3.3876437e-05 0 1.0713 -1
15 2 166.2 207.68 Migr 0 1 134755.98 1.0135379 7.5212833e-06 0 1.1711 -1
15 2 166.2 207.68 Migr 1 0 139267.34 6.0189739 4.3218847e-05 0 1.1565 -1
The full interpretation of the output is somewhat involved, and we recommend referring to the supporting publication for a complete explaination of all relevant quantities. However, in brief:
- Iter: The EM iteration to which this file refers. In the case of aggregated files (
chunkfinal.out
,result.out
), multiple iterations are represented in the same file, and it is often useful either to graph each iteration seperately to show convergence, or to select the last iteration to show the final inference. - Epoch: The discrete piece of time in which rates are assumed to be constant. The pattern which epochs follow is set by the
P
flag. - Start: The beginning of the epoch, in generations.
- End: The end of the poch, in generations.
- Type: Either
Coal
(coalescence),Delay
,LogL
(Log likelihood), orMigr
. The important lines for interpreting output are:Coal
: This line refers to coalescent events. ForCoal
events, theFrom
refers to the population of interest while theTo
is meaningless. An estimate ofNe
is shown only inCoal
lines.Migr
: This line refers to migration events.From
andTo
refer to the source and the sink of migration, backwards in time.
- Opp: The possible time where events of this can may happen in this sequence. For recombination inference, this refers to the amount of time multiplied by the amount of sequence.
- Count: The number of events of this type inferred in this epoch.
- Rate: Either the coalescent, recombination, or migration rate. This is equal to the
Count
divided byOpp
. Coalescent rates are number of coalescentn events per generation. Migration rates represent proportion migration from the source (From
) to the sink (To
) backwards in time.
Note
SCRM
reports migration rates forwards in time while smcsmc
reports it backwards in time.
- Ne: The estimated population size in this epoch. Defined to be
Opp
divded by2*Count
. - ESS: Effective sample size, which is an indicator of the diversity of our particles. This is used when deciding to resample, or when to spawn new particles.
- Clump: This is defined in aggregated output, and not defined in intermediary files.
Important diagnostic information about the reliability of inferred rates is contained in the results file. For instance, it is often useful to look at the amount of events inferred for each epoch. In general, it should monotonically increase, though this will depend on the definition of your epochs. Can you spot the oddity above?
Very low counts in a particular epoch can be due to any number of reasons, but a good first step is to increase the number of particles.
Visualising Output¶
We provide several functions for visualising different aspects of the output.
It is a good idea to first check the convergence of your model by plotting each iteration seperately in what we refer to as a rainbow plot. Specify the path to your output result.out
file and the path to a preferred file path to store the image.
result_path = 'output/result.out'
plot_path = 'output/rainbow.png'
smcsmc.plot_migration(result_path, plot_path)
An example plot is shown below, note that we have additionally specified a model from which this data was simulated. For details, please refer to the tutorials.

Estimated population size over 15 EM iterations each plotted in a different colour. The model appears to have well converged to a solution.
Plotting migration and effective population size can be done similarly with smcsmc.plot_migration
and smcsmc.plot_ne. See the API documentation for more information.
General Usage¶
Explore this section to learn more about how to use various aspects of smcsmc
.
Input Arguments¶
This section describes the general structure of smcsmc
arguments and provides information on how they are interpreted.
General Structure¶
The user has a single entry point into smcsmc
, smcsmc.run_smcsmc
. This function takes a single main argument.
Input arguments are always formatted into a dict and require both a key and a value. The key is always the name of the argument, and the value is generally the value of the arguement. In some cases they differ, and it is important to understand when this is the case. All values should be given as strings unless otherwise noted. This is simply a convenience to avoid complicated post processing.
import smcsmc
args = {
'seg': 'test_seg.seg'
'nsam': '4'
}
smcsmc.run_smcsmc(args)
The arguments given entirely determine the smcsmc
inference.
Processing of Arguments¶
There are three main kinds of arguments to smcsmc
.
- Key value pairings: These are the most common arguments, and require both a key and value.
- e.g.
{'chunk': '100'}
- e.g.
- Boolean arguments: To input a boolean, the name of the argument should be given and an empty string passed as its value. This will be picked up in processing and treated correctly.
- e.g.
{'no-infer-recomb': ''}
- e.g.
- Mulitple arguments of the same name: Some arguements to
smcsmc
are passed directly toSCRM
and act as psuedo-ms
code. In this case, pass arguements having the same name as a vector andsmcsmc
will process them into the correct number of identically named arguements.- e.g.
{'eM': ['0.0092 10', '0 5']}
- e.g.
Todo
This is not yet implemented.
smcsmc
will also understand that an argument passed with None
as a value will be removed entirely. This is a convenience for reusing input.
Arguments¶
[*]
arguments are required. [+]
arguments are optional, but one amungst the group is required.
The following are arguments that define properties of the population you are studying. They are fixed and do not change.
Key | Value | Description |
---|---|---|
nsam | n | [*] Set the total number of samples to n |
N0 | N | [+] Set (unscaled) initial population size to N |
mu | s | [+] Set mutation rate (per nucleotide per generation) |
t | th | [+] Set mutation rate (expected number of mutations) for the locus (=4 N0 mu L ) |
length | L | [+] Set locus length (nucleotides) |
I | n s1..sn | Use an n-population island model with si individuals sampled |
ej | t i j | Speciation event at t*4N0; creates population i from population j in forward direction |
eI | t s1..sn | Sample s1..sn indiviuals from their populations at time t*4N0 generations |
The following arguements describe initial values which will be inferred and updated during runtime.
Key | Value | Description |
---|---|---|
rho | rho | [+] Set recombination rate (per nucleotide per generation) |
r | r L | [+] Set initial per locus recombination rate (=4 N0 L rho) and locus length (L) |
eN | t n | Change the size of all populations to n*N0 at time t*4N0 |
en | t i n | Change the size of population i to n*N0 at time t*4N0 |
eM | t m | Change the symmetric backward migration rate to m/(npop-1) at time t*4N0 |
em | t i j m | Change the backward migration rate from population i to population j to m/(npop-1) at time t*4N0 |
ema | t s11 s12 … | Set backward migration rate matrix at time t*4N0 |
The following arguements define inference related options.
Key | value | Description |
---|---|---|
o | f | [*] Output prefix |
seg | f | [+] Input .seg file |
segs | f1 f2 … | [+] Input .seg files (will be merged into a single .seg file |
maxgap | n | Split .seg files over gaps larger than maxgap (200 kb) |
minseg | n | After splitting ignore segments shorter than minseg (500 kb) |
startpos | x | First locus to process (1) |
P | s e p | Divide time interval [s - e] (generations; s>0) equally on log scale using pattern p (e.g. 1*2+8*1) |
Np | n | Number of particles |
seed | s | Random number seed |
calibrate_lag | s | Accumulate inferred events with a lag of s times the survival time (2) |
apf | b | Auxiliary particle filter: none (0) singletons (1) cherries (2) |
dephase | Dephase heterozygous sites (but use phasing for -apf) | |
ancestral_aware | Assume that haplotype 0 is ancestral | |
bias_heights | t0..tn | Set recombination bias times to h0..hn * 4N0 |
bias_strengths | s1..sn | Set recombination bias strenghts |
arg | range | Sample posterior ARG at given epoch or epoch range (0-based closed; e.g. 0-10) |
These arguments define the behaviour of the parameter updates via stochastic EM or Variational Bayes.
Key | Value | Description |
---|---|---|
EM | n | Number of EM (or VB) iterations-1 (0) |
VB | Use Variational Bayes rather than EM (uniform prior for all rates) | |
cap | n | Set (unscaled) upper bound on effective population size |
chunks | n | Number of chunks computed in parallel (1) |
no_infer_recomb | Do not infer recombination rate | |
no_m_step | Do not update parameters (but do infer recombination guide) | |
alpha | t | Fraction of posterior recombination to mix in to recombination guide (0.0); negative removes files |
These are general options.
Overview of Inference Procedure¶
smcsmc
uses a particle and several novel methodological advancements to estimate epoch specific demographic parameters from sequence. Here we outline the general inference procedure, however for those interested, the complete procedure is given in the preprint.
We observe polymorphisms across each of \(n\) sequences and model a continuous-time stochastic process with piecewise constant realisations \(x:[1,L) \rightarrow \mathbb{T}\) where \(\mathbb{T}\) is the state space of the Markov process. In our case, \(\mathbb{T}\) represents the set of genealogical trees over \(n\) sequences. Our goal is to sample from the posterior distribution of genealogical trees.
We extend the notion of a particle filter to Markov jump processes which are continuous in both time (sequence) and space (the uncountable set \(\mathbb{T}\)). In this case, we are able to use Radon-Nikodym derivates as opposed to ratios of probability distributions.
Sampling Importance Resampling¶
Particle filters build up a sample from the posterior sequentially using the notion of Monte Carlo approximation.
Suppose we want to find the expectation of some function \(f\) and some hidden variable \(z\). We would like to draw samples from the posterior distribution \(p(z_n | X_n)\) where \(X_n = (x_1, \dots, x_n)\) are observed. Using Monte Carlo approximation:
where the approximation holds as \(n \rightarrow \infty\). Sampling weights \(\left\{ w^{(l)}_n \right\}\) are defined as
and the set \(\left\{ z^{(l)}_n \right\}\) along with their weights represent the posterior \(p(z_n|x_n)\). However, in many cases the true distribution \(p\) is difficult to sample from, and importance sampling is used. In IS, samples are taken from a so-called “proposal” distribution \(q\) and weights are adjusted accordingly such that
With new weights
And remembering that in our case, the above case would use Radon-Nikodym derivatives as opposed to ratios of probability distributions; for brevity we have illustrated the discrete case.
Using a very similar procecure, we can sample from \(p(z_{s+1} | z_{1:s})\) and adjust weights to approximate \(p(z_{s+1} | x_{s+1})\). See the relevant section in the preprint for more information on this step.
However, because samples from \(q\) generally are not close to the target distribution \(p\), the proportion of particles which are close to the posterior’s mode diminishes exponentially each iteration. To address this, a resampling step draws samples from the proposal distribution, assinging each a weight of \(\frac{1}{N}\). However, in order to avoid unnessarily inflating the variation at the current time, resampling is only performed when the effective sample size (ESS) drops below a certain threshold.
The “waypoints”, or positions at which we consider resampling, require some attention. Too few waypoints will increase the variation of the approximation, while too many will impact computational efficiency without any gains in accuracy. We derive a criteria that avoids particle degeneracy and find that if waypoints are considered at each observation and additional waypoints are added such that waypoints are never more than \(\frac{1}{\sqrt{2 \sigma^2}\) then the ESS will not drop more than \(\sqrt{\frac{1}{e}}\) between waypoints. This avoids degeneracy.
In this way, we sequentially build up an approximation of the posterior.
A Lookahead Likelihood¶
Not every particle is equally important for approximating the posterior. Upweighting particles which will be relevant in the future at the expense of those only relevant to the current time will increase the overall accuracy of the algorithm. We extend Pitt and Shephard’s discrete time Auxiliary Particle Filter to our continuous-time case.
We evaluate the likelihood (though this term is used loosely) of the current state \(x^{(i)}_s\) at some future point which is not too far beyond the current point, though how far ahead will depend on how well this lookahead distribution fits the true distribution.
We modify the above sampling algorithm to incorporate a second set of “resampling weights” which incorporate this lookahead likelihood. When resampling is performed, particles are weighted by this second, informed, weight, to better adapt to future variation.
In practice, we require a tractable approximate likelihood of future data conditioned on a given genealogy. Several simplifications are neccessary, and we primarily incorporate singletons and doubletons which are informative about topology near the tips of the genealogy. We primarily look at the distance \(s_i\) to the nearest future singleton for each sequence, and the mutually consistent cherries with their supporting doubletons. We derive an approximation of the likelihood of the current genealogy given these data.
Parameter Inference¶
Parameters may be inferred either by stochastic expectation maximization (SEM) or via Variational Bayes (VB). SEM maximizes the expected log likelihood over a posterior distribution of latent variables (as generated above) and gives a maximum a posteriori (MAP) estimate of parameter values \(\theta\). VB can substantially improve on SEM by iteratively estimating the posterior distirbution of \(\theta\) rather than taking a point estimate. In practice, this distribution is not reported in the current implementation, however we find that using VB is useful in avoiding fixed-point solutions in SEM.
Clustering Computing¶
smcsmc
uses a lot of compute time. To ammeliorate this, it is highly recommend that users should make use of compute clusters, whether academic or otherwise. In this growing guide, we show how to leverage various compute cluster environments to best use smcsmc
. Please help us to fill in this section if you use a different cluster computing system.
SGE / qsub
¶
The code ships with native support for SGE clusters and usage is relatively straightforward. Procedure for PBS clusters should be almost identical, and only minor tweaking of model.py
will give identical functionality.
Tip
It’s a good idea to run smcsmc
in a tmux
or screen
session on the head node of your compute cluster. This will ensure that if your connection is interupted the program will not stopped.
- If you wish to use a queue other than your default, place a
qsub.conf
file in the same directory that you want to conduct your analysis. - Specify a number of chunks in your input arguements. A good number to start with for typical human sequences is 100.
- Tell
smcsmc
that you wish to spawn cluster jobs by specifying thec
option. Note that this overrides the use of multithreading. If you wish to use specific parameters for your cluster session, theC
flag can be used to provide specific directives and will take priority over all instructions given inqsub.conf
.
That’s it!
As an example, the following would be added to any other input arguements that you wish to use:
cluster_args = {
'c': '',
'chunks': '100',
'C': '-P your.prc -q short.prj'
}
For more information about how input arguements are handled by smcsmc
see
Tutorials¶
Before embarking on any of these tutorials, it is highly recommended that you set up your smcsmc
installation to work alongside a compute cluster, if at all possible.
Simons Global Diversity Panel¶
In this tutorial we will use smcsmc
to analyse the population history of a population from the Simons Global Diversity Panel.
We very arbitrarily chose to analyse the population history of the Mbuti, a South-Central African group with an ancient divergence from the rest of the continent.
Downloading and Converting Data¶
We will only download one chromosome of data for the sake of this tutorial.
First, make a new folder to hold your analysis of the Mbuti:
mkdir smcsmc-mbuti
cd smcsmc-mbuti
We will download a small chromosome from the phased release of the SGDP provided by the Reich Lab and save it as sgdp_phased_chr21.vcf.gz
:
wget https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/phased_data/PS2_multisample_public/cteam_extended.v4.PS2_phase.public.chr21.vcf.gz -O sgdp_phased_chr21.vcf.gz
We then need to convert this sample from vcf
file format to the segments which smcsmc
requires. We can use smcsmc.vcf_to_seg()
.
import smcsmc
smcsmc.vcf_to_seg(
[('sgdp_phased_chr21.vcf.gz', 'S_Mbuti-1')],
output = 'mbuti_chr21.seg.gz',
chroms = [21])
Warning
This can take a few minutes.
Now we have a list of all the segments in the Mbuti individual.
Running the Model¶
We are going to follow a very similar procedure as the Getting Started guide:
import os
smcsmcpath = os.path.expandvars('${CONDA_PREFIX}/bin/smcsmc')
args = {
'EM': '15',
'Np': '10000',
# Submission Parameters
'chunks': '100',
'no_infer_recomb': '',
'smcsmcpath': smcsmcpath,
# Other inference parameters
'mu': '1.25e-9',
'N0': '14312',
'rho': '3e-9',
'calibrate_lag': '1.0',
'tmax': '3.5',
'alpha': '0',
'apf': '2',
'P': '133 133016 31*1',
'VB': '',
'nsam': '2',
# Files
'o': 'run',
'seg': 'mbuti_chr21.seg.gz'
}
If you are able, add in the c
flag to run on a cluster compute system.
Then simply run the model.
smcsmc.run_smcsmc(args)
Todo
Run this model and include a plot of the results.
Simulated Data with SCRM¶
Simulated Data with msprime¶
Another popular framework for coalescent simulation is msprime
, a reimplimentation of Hudson’s ms
. In this tutorial we show how to simulate data with msprime
under any model and reinfer demographic parmeters with smcsmc
.
Data Generation¶
This is not a guide for using msprime
, and if you are unfamiliar with syntax, the documentation is very helpful. Here we are essentially following the analysis implemented in the PopSim analysis comparing multiple tools for demographic inference.
import msprime
from stdpopsim import homo_sapiens
# Here we use a three-population Out of Africa (OoA) model of human history.
#
# It is stored in the stdpopsim homo_sapiens class.
#
# Your model could be anything.
model = getattr(stdpopsim.homo_sapiens, "GutenkunstThreePopOutOfAfrica")
# We are interested in four individuals from the second (European) population.
#
# We take these samples in the present day.
samples = [msprime.Sample(population = 1, time = 0)] * 4
# Perform the simulation
# We pass all of the demographic events from the model as a dictionary to the simulation function.
ts = msprime.simulate(
samples = samples,
mutation_rate = 1.25e-8,
**model.asdict())
# And spit out the tree file.
ts.dump('msprime.tree')
We include functions to convert from tree sequence dumps to the seg
files necessary for smcsmc
analysis.
import smcsmc
smcsmc.ts_to_seg('msprime.tree')
Which will write msprime.seg
in the current working directory. We now, fairly straightforwardly, analysing this seg data using the same procedure outlined in Getting Started.
import smcsmc
args = {
'EM': '15',
'Np': '10000',
# Submission Parameters
'chunks': '100',
'c': '',
'no_infer_recomb': '',
# Other inference parameters
'mu': '1.25e-9',
'N0': '14312',
'rho': '3e-9',
'calibrate_lag': '1.0',
'tmax': '3.5',
'alpha': '0',
'apf': '2',
'P': '133 133016 31*1',
'VB': '',
'nsam': '4',
# Files
'o': 'run',
'seg': 'msprime.seg'
}
smcsmc.run_smcsmc(args)
Which creates the following output result.out
:
Iter Epoch Start End Type From To Opp Count Rate Ne ESS Clump
15 0 0 133 Coal 0 -1 1646116.6 21.30762 1.2944174e-05 38627.416 1.3013 -1
15 1 133 166.2 Coal 0 -1 872269.46 26.817873 3.0744941e-05 16262.838 1.9003 -1
15 2 166.2 207.68 Coal 0 -1 1357402.8 2.1042327 1.5501903e-06 322541.04 2.1558 -1
15 3 207.68 259.53 Coal 0 -1 2109009.6 14.543791 6.8960285e-06 72505.501 2.4425 -1
15 4 259.53 324.31 Coal 0 -1 3278185.9 49.605303 1.5131937e-05 33042.696 2.7563 -1
15 5 324.31 405.26 Coal 0 -1 5071446.2 172.38 3.3990305e-05 14710.077 3.7783 -1
15 6 405.26 506.42 Coal 0 -1 7753527.6 378.89 4.886679e-05 10231.898 3.719 -1
15 7 506.42 632.82 Coal 0 -1 11654873 1049.43 9.0042165e-05 5552.954 4.6282 -1
15 8 632.82 790.79 Coal 0 -1 16851889 2312.22 0.00013720836 3644.0929 5.2067 -1
15 9 790.79 988.18 Coal 0 -1 23398842 3743.37 0.00015998099 3125.3713 6.1686 -1
15 10 988.18 1234.8 Coal 0 -1 31119159 5716.96 0.0001837119 2721.6527 7.2951 -1
15 11 1234.8 1543.1 Coal 0 -1 40005064 7390.63 0.00018474236 2706.4718 7.5514 -1
15 12 1543.1 1928.2 Coal 0 -1 50983716 8280.42 0.00016241303 3078.5707 9.1861 -1
15 13 1928.2 2409.6 Coal 0 -1 65470637 9105.62 0.00013907945 3595.0675 9.8557 -1
15 14 2409.6 3011 Coal 0 -1 85565757 9585.52 0.00011202519 4463.282 11.876 -1
15 15 3011 3762.6 Coal 0 -1 1.1401707e+08 10315.08 9.0469615e-05 5526.7174 15.182 -1
15 16 3762.6 4701.8 Coal 0 -1 1.5524053e+08 10862.25 6.9970451e-05 7145.8736 17.483 -1
15 17 4701.8 5875.5 Coal 0 -1 2.1375026e+08 12649.97 5.9181073e-05 8448.647 20.76 -1
15 18 5875.5 7342.1 Coal 0 -1 2.9163573e+08 16020.13 5.4931986e-05 9102.1649 27.88 -1
15 19 7342.1 9174.7 Coal 0 -1 3.863795e+08 21177.96 5.4811293e-05 9122.2078 34.639 -1
15 20 9174.7 11465 Coal 0 -1 4.88984e+08 28276.39 5.7826821e-05 8646.5068 43.304 -1
15 21 11465 14327 Coal 0 -1 5.8197353e+08 36240.01 6.2270891e-05 8029.4339 51.301 -1
15 22 14327 17903 Coal 0 -1 6.451339e+08 43546.96 6.7500653e-05 7407.3357 60.539 -1
15 23 17903 22372 Coal 0 -1 6.6748478e+08 47326.54 7.0902801e-05 7051.9076 68.849 -1
15 24 22372 27956 Coal 0 -1 6.4691201e+08 46774.72 7.2304609e-05 6915.1885 74.323 -1
15 25 27956 34934 Coal 0 -1 5.8593471e+08 42400.23 7.2363404e-05 6909.5699 85.549 -1
15 26 34934 43654 Coal 0 -1 4.8670448e+08 34939.75 7.1788428e-05 6964.9108 87.582 -1
15 27 43654 54551 Coal 0 -1 3.6252162e+08 25596.63 7.0607181e-05 7081.4327 93.198 -1
15 28 54551 68167 Coal 0 -1 2.3434808e+08 16196.81 6.9114327e-05 7234.39 76.176 -1
15 29 68167 85183 Coal 0 -1 1.2699601e+08 8639.48 6.8029541e-05 7349.7482 72.283 -1
15 30 85183 1.0645e+05 Coal 0 -1 54744833 3688.46 6.7375491e-05 7421.0962 66.976 -1
15 31 1.0645e+05 1.3302e+05 Coal 0 -1 17948123 1184.46 6.5993531e-05 7576.5002 65.159 -1
15 32 1.3302e+05 1e+99 Coal 0 -1 4926174.5 316.93 6.4335926e-05 7771.7076 72.307 -1
15 -1 0 1e+99 Delay -1 -1 3.0934923e+09 0 0 0 1 -1
15 -1 0 1e+99 Recomb -1 -1 3300 9.9000003e-06 3.0000001e-09 0 1 -1
15 -1 0 1e+99 Resamp -1 -1 3.0934923e+09 882448 0.00028525948 0 1 -1
Of course your output will not be the same, but if you have properly set up smcsmc
to use a compute cluster and given it sufficient time to execute, then the resulting trends will be highly similar.
Todo
Combine the functions to produce a plot with a guide so that I can show it here. Currently I have a SCRM plot with guide and an msprime plot from stdpopsim but not both.
Five replications of this leads to the following results:

API Documentation¶
Main Function¶
smcsmc.
run_smcsmc
(args)[source]¶The main entry point to
smcsmc
, this function runs the whole analysis portion from start to finish. The one single argument is a dictionary of arguments as detailed on the Input Arguments.Tip
It’s a really good idea to run
smcsmc
in atmux
session on the login node of your cluster if you are doing large analyses. The main loop takes very little memory, and spawns off cluster jobs if it is configured to do so (Clustering Computing).
Parameters: args (dict) – A dictionary of arguments as per Input Arguments. An entirely equivalent entry point is the command line interface called
smc2
. See Command Line Interface for examples.
smcsmc
requires segment files as input. Below we detail three main ways to create them.
- From VCF files
- From
msprime
sufficient tree sequences- From
SCRM
simulationsIf you are using a different data type and would like help converting it to seg format, please let us know. For more details on converting files and interpreting the output of the algorithm, please see Getting Started.
Format Conversions¶
-
smcsmc.
ts_to_seg
(path, n=None)[source]¶ Converts a tree sequence into a seg file for use by
smcsmc.run_smcsmcs()
. This is especially useful if you are simulating data frommsprime
and would like to directly use it insmcsmc
. For details of how to do this, please see the tutorial on simulation usingmsprime
.Provide the path to the tree sequence, and the suffix will be replaced by
.seg
. This code is adapted from PopSim.Parameters:
-
smcsmc.
vcf_to_seg
(input, output, masks=None, tmpdir='.tmp', key='tmp', chroms=range(1, 23))[source]¶ This function converts data from a VCF to the segment type required for
smcsmc
. You can also (optionally) include masks for the VCF (for example, from 1000 genomes) to indicate callable regions. This function first creates a number of intermediary VCF files intmpdir
, identifiable by theirkey
and sample IDs. The function, by default, will loop over all chromosomes and create seperateseg.gz
files for each of them, however you can specify only a subset of chromosomes by providing a list tochroms
.It is preferable for VCFs to be phased, but it is by no means necessary. Phasing helps to improve the effectiveness of the lookahead likelihood.
Warning
This function does not take a lot of memory, but it can run for quite a while with genome-scale VCFs.
The format of the input argument is as follows:
input = [ ("path_to_vcf", "sample_ID_1"), ("path_to_vcf", "sample_ID_2) ]
For each individual that you would like to include in the segment file, you must specify its identifier and the path to its VCF. This means, in some cases, that you will be repeating the VCF paths. That’s okay. Give the pairing of the VCF path and Sample ID (in that order) as a tuple, and give the list of tuples as the input argument. This is the same order in which individual’s genotypes will appear in the seg file.
Chromosomes
If you have one VCF with many chromsoomes, specify the chromosomes that you want to use in the anlaysis in the
chrom
option, as mentioned above. If you have many VCFs for each chromosome, you will have to run this function separately for each, specifying the correct chromsome each time.Masks
Masks are bed files which indicate the callable regions from a VCF file, and may be included. If you do include masks, please make sure to include at least as many masks as individuals.
Example
If you wanted to convert two individuals from two different VCFs, whilst specifying masks, for chromosomes 5,6, and 7:
smcsmc.vcf_to_seg( input = [ ("/path/to/vcf1", "name_of_individual_1"), ("/path/to/vcf2", "name_of_individual_2")], output = "chrs567.seg.gz", masks = [ "/path/to/mask1.bed.gz", "/path/to/mask2.bed.gz"], key = "chrs567", chroms = [5,6,7] )
Which would create
chr567.seg.gz
in the current working directory.Parameters: - input (list) – List of tuples, where the first element is the path to the VCF file and the second is the individual to be included.
- output (str) – Path to the output segment file. Gzipped if the suffix indicates so.
- masks (list) – List of masks for each of the individuals given in
input
. These are positive masks. Masks are given as bed files, optionally gzipped. - tempdir (str) – Directory to write intermediary vcf files. This is not cleaned after runs, so make sure you know where it is! This is to preserve the files for any further conversions.
- key (str) – Unique identifier of this run.
- chroms (list) – Either a list or range of chromosomes codes to use in this particular run.
Todo
It would be good to have a “cleanup = True” option to get rid of the intermediary files. This would be highly inefficient for rerunning but maybe worth it for some people.
Returns: Nothing.
Plotting¶
-
smcsmc.plot.
plot_migration
(input='result.out', output='result.png', g=30, ymax=0.00025)[source]¶ Plot the migration between two groups.
The function pulls from the input, which must be some sort of aggregated output from
smcsmc
, and plots the migration rate over time for a given generational timeg
, which scales the number of generations to years. The Y axis (which can be modified with theymax
parameter) represents \(m_{ij}\), or the proportion of population j being placed by migration from population i backwards in time. This is important, as most intution about migration is understood forward in time.This is a barebones function and there is no option to provide a “truth” bar. See the other plotting functions for various definitions of “truth”.
As an example, here is the resulting (symmetric) migration from a simulation:
Todo
Typo above, it is not log years, rather years and the axis is logged.
Parameters: - input (str) – This must be the file path to some sort of aggregated output from
smcsmc
. This means that it can be either achunkfinal.out
orresult.out
but not an individual chunk. We need all the data here. - output (str) – Filepath to save the plot.
- g (int) – The length of one generation in years. This is used to scale the x axis.
- ymax (float) – The maximum y value to plot. This is used to scale the plots up or down.
- input (str) – This must be the file path to some sort of aggregated output from
-
smcsmc.plot.
plot_rainbow
(input, output, g=30, model=None, steps=None, pop_id=1)[source]¶ Creates a plot of all iterations by colour. This plot is useful for assessing convergence.
Parameters: - input (str) – The full file path to an aggregated output file from
smcsmc
. - output (str) – Filepath to save the plot.
- g (int) – The length of one generation in years. This is used to scale the x axis.
- model (stdpopsim.Model) – Model for plotting.
- ymax (float) – The maximum y value to plot. This is used to scale the plots up or down.
- steps (int) – Don’t worry about this.
- pop_id (int) – If your model includes multiple populations, which one do you want to plot?
- input (str) – The full file path to an aggregated output file from
-
smcsmc.plot.
plot_with_guide
(input, guide, output, g=30, ymax=0.00025, N0=14312)[source]¶ This function is very similar to
smcsmc.plot.plot_migration()
except that it includes the ability to add a bar of “truth”. In this case, the function uses a specific form of “truth” generated from recording all epochs output bySCRM
. Additionally, we provide both the effective population size and the migration rates.The structure of the truth guide is like so:
Start Time Pop_1 Ne Pop_2 Ne Pop_1 M Pop_2 M 0 3 3 6 2 100 0.65 0.3 14 0 … . . . . 10000 0.4 0.4 8 6 Saved as a CSV file for the
guide
argument.Todo
This function is part of a WIP tutorial on simulating with
SCRM
. More details and convenience functions to come.Parameters: - input (str) – The full file path to an aggregated output file from
smcsmc
. - guide (str) – The full file path to a CSV formated as above with the truth of a simulation.
- output (str) – Filepath to save the plot.
- g (int) – The length of one generation in years. This is used to scale the x axis.
- ymax (float) – The maximum y value to plot. This is used to scale the plots up or down.
- input (str) – The full file path to an aggregated output file from
Command Line Interface¶
An entirely equivalent command line interface (CLI), called smc2
, may be used in place of the python
API. It is installed into the conda
bin (and alternatively given in the root of the repository).
For a full listing of arguments, type smc2 -h
into a terminal, or alternatively see Input Arguments.
Glossary¶
- Demographic Parameters
- Effective population size and migration rates discretised over a given number of epochs. Within each of these epochs, the rates are assumed to be constant.
- Rainbow Plot
- A plot used to assess the convergence of EM. The demographic parameters inferred from each epoch are plotted in a seperate colour to visually determine if significant improvement could be made by letting the model run more iterations.
- Particle
- The basic unit of a particle filter, in our case a particle represents an ancestral recombination graph simulated by
SCRM
. Particles are simulated, and updated along the sequence by genetic information that they encounter. Particles are resampled once the effective sample size of the population reaches a given threshold and are weighted according to their approximate likelihood. Asymtotically, this form of resampling importance sampling approaches the true posterior. - Particle Filter
- Particle filters, or Sequential Monte Carlo (SMC) methods, are a class of algorithms which use sampling importance resampling (SIR) to generate samples from the posterior distribution of a latent variable. In our case, this is the set of trees along the sequence. The posterior is approximated using weighted random samples denoted as “particles” drawn from a known, tractable distribution. See Tulsyan, Gopaluni, and Khare 2016 for a very readable review of the basic principles behind particle filters for inference.
- Sequentially Markovian Coalescent
- This may refer to one of a number of variations on McVean and Cardin 2005. The authors derive an approximation of the coalescent with recombination in which lineages with no overlapping ancestral material may not coalesce. This approximation leads to a Markovian state space of genealogies along the sequence, and allows for tractable inference.
- Variational Bayes
- A method used to infer the posterior distribution of demographic parameters from a set of particles. Variation Bayes is quite similar to stochastic Expectation Maximation, except that it generates a full posterior distribution of parameters rather than a single most probable value.
Changelog¶
- 1.0.1: WIP
- Introduced a function to convert from VCF to
seg
file formats and two tests to ensure that it is working. - Significantly improved the API documentation.
- Introduced a function to convert from VCF to
- 1.0: June 18, 2019
- Initial release!