Welcome to the documentation for SMCSMC!

This is the reference manual for SMCSMC (or smcsmc), a method and program for inferring ancestral population size and migration from whole genome sequences.

There are two versions of this documentation. The stable version describes functionality of the most recent conda release, while the latest version documents new features in the development branch.

Introduction

SMCSMC, short for Sequential Monte Carlo for the Sequentially Markovian Coalescent, is a method for estimating ancestral population size and migration history from sequence data. It has several advantages over comparable methods, especially when you are interested in analysing complex demographic models.

The method uses a particle filter to sample from the posterior distribution of trees along the sequence and Variational Bayes to infer epoch specific demographic parameters over a given number of iterations.

smcsmc takes as input optionally phased sequencing data formatted as segments, and provides utilities for analysing and visualising the inferred ancestral rates.

Installation

We highly recommend installing SMCSMC from conda, as it comes packaged with all necessary dependencies. A seperate guide for manual compilation may be found in the developer reference. See here for a helpful guide to installing and using conda to manage programs.

First add both conda-forge and terhorst to your channel lists (if they are not there already), then install smcsmc.

conda config --add channels conda-forge
conda config --add channels terhorst

conda install -c luntergroup smcsmc

Basic Usage

smcsmc can be run on the command line with smc2 or through the python API.

smc2 -h

Follow the Getting Started guide to become familiar with the basic structure and function of SMCSMC commands, then look at one of the tutorials for analysing simulated or real data. For a more complete guide to arguments, see Input Arguments.

Other Methods

SMCSMC is part of the PopSim consortium, and we are actively involved in building a framework to standardize population genetic analyses. Part of this involves making it easy to run the same analysis with many different methods. We have built smcsmc with this goal in mind. For the latest information about comparisons between different population genetic software, including smc++, stairwayplot, msmc, and dadi/fastcoal, check out the PopSim analysis repository.

_images/popsim.png

Population history of a European-acting individual inferred from five replicates of the stdpopsim.homo_sapiens.GutenkunstThreePopOutOfAfrica model of human history.

Citation

If you use smcsmc in your work, please cite the following article:

  1. Henderson, D., Zhu, S. (Joe), & Lunter, G. (2018). Demographic inference using particle filters for continuous Markov jump processes. BioRxiv, 382218. https://doi.org/10.1101/382218
  2. Staab, P. R., Zhu, S., Metzler, D., & Lunter, G. (2015). scrm: efficiently simulating long sequences using the approximated coalescent with recombination. Bioinformatics, 31(10), 1680–1682. https://doi.org/10.1093/bioinformatics/btu861

Getting Started

This guide will familiarise your with the concepts necessary to run analyses in SMCSMC and take you through the basic aspects of an analysis using toy data. You may also be interested in one of several examples using real and simulated data in the Tutorials section.

Note

This tutorial is intended to be run on a personal computer, and values have been scaled accordingly. For analysis on real data, we expect that a user will have access to a compute cluster, and we provide guidance about setting up smcsmc to function with common architechtures in Clustering Computing.

Basic Concepts

SMCSMC is a particle filter, which means that a given number of particles are simulated, evaluated for an approximate likelihood, and resampled until demographic parameters have converged. Many of the options relate to the behaviour of this particle filter, and the remainder deal with the demographic model that you wish to infer. SMCSMC can infer complex demographic models involving several populations, though you should be weary of overspecifying your model. Run time increases drastically with the number of parameters needed to specify the model.

Input Format

Suppose we have the following data, saved into a file named toy.seg. You can also download it here.

segment start segment length invariant break chr genotype
1 521 T F 1 01.0
522 2721 T F 1 0111
3243 1758 T F 1 10//
5001 1296 T F 1 0000
6297 1 T F 1 ….
6298 4669 T F 1 0110
10967 880 T T 1 0100
1 708 T F 2 1010

This gives the state of four haplotypes along the sequence, indicating segment starts and their lengths. If a segment is the last in a block before coordinates reset (i.e. when including multiple chromosomes in a single seg file, it is called a genetic break. Genotypes in terms of either major/minor or ancestral/derived are given for each haplotype. Missing data is coded with a period (.) whilst unphased variants are given with a forward slash (/).

Note

This data format is similar to the input for msmc however, in the second column we require the length from the current segment to the next, rather than the number of called sites from the previous segment to the current one.

Segment files are specified by the seg flag, or if more than one are used, by the segs flag. Globs are encouraged for readability. In the case of multiple files, genetic breaks are inferred from the beginnings and ends of files. Additionally, we tell smcsmc the number of samples we expect with nsam, so it knows to check the seg file for formatting issues.

Basic Arguements

In addition to the segment file, we need a few more pieces of information to kick off the particle filter.

Particle Count: We must specify a particle count (Np) or use the default value of 1000. The number of particles refers to the number of individual ARGs which are simulated by SCRM, higher numbers means that the model is more likely to converge on a reasonable answer, but increasing the particle count is computationally expensive. We recommend particle counts between 25 and 50 thousand for analysis. However, for exploratory work, smaller particle counts (a good starting point is 10 thousand) can be used to more effectively use computational resources. In this guide, intended for use on a personal computer, we will use 10 particles.

Number of Iterations: We also explictly define the number of expectation maximization iterations that we want to use (EM). A good place to start is 15 iterations. smcsmc looks for output before starting, so should you be unsatisfied with the convergence after 15 iterations, simply run the exact same program with a higher number of iterations. smcsmc will find the output for the iterations already run and simply continue from where it left off.

Demographic Parameters: Several arguements are good to specify, especially when analysing real data as they help with convergence. Here we will use a mutation rate (mu) of 1.25e-8 and a recombination rate (rho) of 3e-9. We give an effective population size (N0) of 14312, and infer trees back to 4*N0 generations with tmax. We set this to 3.5, which is 1.2 Mya with an N0 of 10000. smcsmc infers demographic parameters as discrete over intervals. We specify 31 intervals evenly spaced on the log scale with P 133 133032 31*1 giving the end times of the first and last epochs in generations, and the pattern for their generation.

We also give the path to the output folder with o.

Running smcsmc

Tip

It is always a good idea to start a new conda environment for a new analysis to ensure that there are no dependecy conflicts, though you can skip this step if you certain that your environment is correctly configured.

conda create --name smcsmc_tutorial
conda activate smcsmc_tutorial

Once smcsmc is installed, we can format the arguements detailed above into a dictionary.

import os

args = {
   'seg':                  'toy.seg',
   'nsam':                 '4',
   'Np':                   '10',
   'EM':                   '1',
   'mu':                   '1.25e-8',
   'rho':                  '3e-9',
   'N0':                   '10000',
   'tmax':                 '3.5',
   'P':                    '133 133032 31*1',
   'no_infer_recomb':      '',
   'smcsmcpath':           os.path.expandvars('${CONDA_PREFIX}/bin/smcsmc'),
   # The minimum segment length is usually best left
   # as the default, however for this tutorial we set it
   # artifically low.
   'minseg':               '10',
   'o':                    'smcsmc_output'
}

We directly use this dictionary with the run_smcsmc command, which takes as its only arguement a dictionary of arguements.

import smcsmc
smcsmc.run_smcsmc(args)

If your installation has been successful, then this will begin the process of parsing the input, merging any given seg files, starting the particle filter, and iterating through the EM steps requested.

Output

If your smcsmc has run correctly, the resulting output directory will look something like this, with a seperate folder for each EM iteration, and a seperate file for each chunk, if this option has been used.

output/
        emiterN/
                chunkN.out
                chunkN.stdout
                chunk.stderr
                chunkfinal.out
        merged.seg
        merged.map
        result.log
        result.out

If you are following this tutorial and are only using a single input seg file, you will not see merged.seg or merged.map as there was no need to generate them. The output for each chunk is given, along with stdout and sterr, and results are aggregated over all chunks each epoch into chunkfinal.out. The final epoch will be post processed into result.out. Output and debugging information along with useful information to help interpret the results of your model are given in result.log.

An example of a results.out file is given here:

Iter  Epoch       Start       End   Type   From     To            Opp          Count           Rate             Ne         ESS  Clump
15      0           0         133   Coal      0     -1      307466.77      18.798649  6.1140424e-05      8177.8955      1.0272     -1
15      0           0         133   Coal      1     -1      324087.66      55.469939  0.00017115721      2921.2909      1.0225     -1
15      1         133       166.2   Coal      0     -1       167571.8      1.0002328  5.9689802e-06      83766.402      1.0791     -1
15      1         133       166.2   Coal      1     -1      174231.23    0.067992116  3.9024069e-07      1281260.5      1.0674     -1
15      2       166.2      207.68   Coal      0     -1      259801.05      40.479172  0.00015580835      3209.0707      1.1686     -1
15      2       166.2      207.68   Coal      1     -1      267803.96  0.00028150086  1.0511452e-09  4.7567166e+08      1.1522     -1
15     -1           0       1e+99  Delay     -1     -1  2.6001037e+09              0              0              0           1     -1
15     -1           0       1e+99   LogL     -1     -1              1      -31005174      -31005174              0           1     -1
15      0           0         133   Migr      0      1       178112.7      7.1251662  4.0003695e-05              0      1.0385     -1
15      0           0         133   Migr      1      0      186343.02  0.00022362676  1.2000812e-09              0      1.0338     -1
15      1         133       166.2   Migr      0      1       87712.37      0.5934193  6.7655144e-06              0      1.0819     -1
15      1         133       166.2   Migr      1      0       91253.03      3.0913275  3.3876437e-05              0      1.0713     -1
15      2       166.2      207.68   Migr      0      1      134755.98      1.0135379  7.5212833e-06              0      1.1711     -1
15      2       166.2      207.68   Migr      1      0      139267.34      6.0189739  4.3218847e-05              0      1.1565     -1

The full interpretation of the output is somewhat involved, and we recommend referring to the supporting publication for a complete explaination of all relevant quantities. However, in brief:

  • Iter: The EM iteration to which this file refers. In the case of aggregated files (chunkfinal.out, result.out), multiple iterations are represented in the same file, and it is often useful either to graph each iteration seperately to show convergence, or to select the last iteration to show the final inference.
  • Epoch: The discrete piece of time in which rates are assumed to be constant. The pattern which epochs follow is set by the P flag.
  • Start: The beginning of the epoch, in generations.
  • End: The end of the poch, in generations.
  • Type: Either Coal (coalescence), Delay, LogL (Log likelihood), or Migr. The important lines for interpreting output are:
    • Coal: This line refers to coalescent events. For Coal events, the From refers to the population of interest while the To is meaningless. An estimate of Ne is shown only in Coal lines.
    • Migr: This line refers to migration events. From and To refer to the source and the sink of migration, backwards in time.
  • Opp: The possible time where events of this can may happen in this sequence. For recombination inference, this refers to the amount of time multiplied by the amount of sequence.
  • Count: The number of events of this type inferred in this epoch.
  • Rate: Either the coalescent, recombination, or migration rate. This is equal to the Count divided by Opp. Coalescent rates are number of coalescentn events per generation. Migration rates represent proportion migration from the source (From) to the sink (To) backwards in time.

Note

SCRM reports migration rates forwards in time while smcsmc reports it backwards in time.

  • Ne: The estimated population size in this epoch. Defined to be Opp divded by 2*Count.
  • ESS: Effective sample size, which is an indicator of the diversity of our particles. This is used when deciding to resample, or when to spawn new particles.
  • Clump: This is defined in aggregated output, and not defined in intermediary files.

Important diagnostic information about the reliability of inferred rates is contained in the results file. For instance, it is often useful to look at the amount of events inferred for each epoch. In general, it should monotonically increase, though this will depend on the definition of your epochs. Can you spot the oddity above?

Very low counts in a particular epoch can be due to any number of reasons, but a good first step is to increase the number of particles.

Visualising Output

We provide several functions for visualising different aspects of the output.

It is a good idea to first check the convergence of your model by plotting each iteration seperately in what we refer to as a rainbow plot. Specify the path to your output result.out file and the path to a preferred file path to store the image.

result_path = 'output/result.out'
plot_path = 'output/rainbow.png'

smcsmc.plot_migration(result_path, plot_path)

An example plot is shown below, note that we have additionally specified a model from which this data was simulated. For details, please refer to the tutorials.

_images/rainbow.png

Estimated population size over 15 EM iterations each plotted in a different colour. The model appears to have well converged to a solution.

Plotting migration and effective population size can be done similarly with smcsmc.plot_migration and smcsmc.plot_ne. See the API documentation for more information.

General Usage

Explore this section to learn more about how to use various aspects of smcsmc.

Input Arguments

This section describes the general structure of smcsmc arguments and provides information on how they are interpreted.

General Structure

The user has a single entry point into smcsmc, smcsmc.run_smcsmc. This function takes a single main argument.

Input arguments are always formatted into a dict and require both a key and a value. The key is always the name of the argument, and the value is generally the value of the arguement. In some cases they differ, and it is important to understand when this is the case. All values should be given as strings unless otherwise noted. This is simply a convenience to avoid complicated post processing.

import smcsmc

args = {
   'seg':    'test_seg.seg'
   'nsam':   '4'

}

smcsmc.run_smcsmc(args)

The arguments given entirely determine the smcsmc inference.

Processing of Arguments

There are three main kinds of arguments to smcsmc.

  • Key value pairings: These are the most common arguments, and require both a key and value.
    • e.g. {'chunk': '100'}
  • Boolean arguments: To input a boolean, the name of the argument should be given and an empty string passed as its value. This will be picked up in processing and treated correctly.
    • e.g. {'no-infer-recomb': ''}
  • Mulitple arguments of the same name: Some arguements to smcsmc are passed directly to SCRM and act as psuedo-ms code. In this case, pass arguements having the same name as a vector and smcsmc will process them into the correct number of identically named arguements.
    • e.g. {'eM': ['0.0092 10', '0 5']}

Todo

This is not yet implemented.

smcsmc will also understand that an argument passed with None as a value will be removed entirely. This is a convenience for reusing input.

Arguments

[*] arguments are required. [+] arguments are optional, but one amungst the group is required.

The following are arguments that define properties of the population you are studying. They are fixed and do not change.

Key Value Description
nsam n [*] Set the total number of samples to n
N0 N [+] Set (unscaled) initial population size to N
mu s [+] Set mutation rate (per nucleotide per generation)
t th [+] Set mutation rate (expected number of mutations) for the locus (=4 N0 mu L )
length L [+] Set locus length (nucleotides)
I n s1..sn Use an n-population island model with si individuals sampled
ej t i j Speciation event at t*4N0; creates population i from population j in forward direction
eI t s1..sn Sample s1..sn indiviuals from their populations at time t*4N0 generations

The following arguements describe initial values which will be inferred and updated during runtime.

Key Value Description
rho rho [+] Set recombination rate (per nucleotide per generation)
r r L [+] Set initial per locus recombination rate (=4 N0 L rho) and locus length (L)
eN t n Change the size of all populations to n*N0 at time t*4N0
en t i n Change the size of population i to n*N0 at time t*4N0
eM t m Change the symmetric backward migration rate to m/(npop-1) at time t*4N0
em t i j m Change the backward migration rate from population i to population j to m/(npop-1) at time t*4N0
ema t s11 s12 … Set backward migration rate matrix at time t*4N0

The following arguements define inference related options.

Key value Description
o f [*] Output prefix
seg f [+] Input .seg file
segs f1 f2 … [+] Input .seg files (will be merged into a single .seg file
maxgap n Split .seg files over gaps larger than maxgap (200 kb)
minseg n After splitting ignore segments shorter than minseg (500 kb)
startpos x First locus to process (1)
P s e p Divide time interval [s - e] (generations; s>0) equally on log scale using pattern p (e.g. 1*2+8*1)
Np n Number of particles
seed s Random number seed
calibrate_lag s Accumulate inferred events with a lag of s times the survival time (2)
apf b Auxiliary particle filter: none (0) singletons (1) cherries (2)
dephase   Dephase heterozygous sites (but use phasing for -apf)
ancestral_aware   Assume that haplotype 0 is ancestral
bias_heights t0..tn Set recombination bias times to h0..hn * 4N0
bias_strengths s1..sn Set recombination bias strenghts
arg range Sample posterior ARG at given epoch or epoch range (0-based closed; e.g. 0-10)

These arguments define the behaviour of the parameter updates via stochastic EM or Variational Bayes.

Key Value Description
EM n Number of EM (or VB) iterations-1 (0)
VB   Use Variational Bayes rather than EM (uniform prior for all rates)
cap n Set (unscaled) upper bound on effective population size
chunks n Number of chunks computed in parallel (1)
no_infer_recomb   Do not infer recombination rate
no_m_step   Do not update parameters (but do infer recombination guide)
alpha t Fraction of posterior recombination to mix in to recombination guide (0.0); negative removes files

These are general options.

Overview of Inference Procedure

smcsmc uses a particle and several novel methodological advancements to estimate epoch specific demographic parameters from sequence. Here we outline the general inference procedure, however for those interested, the complete procedure is given in the preprint.

We observe polymorphisms across each of \(n\) sequences and model a continuous-time stochastic process with piecewise constant realisations \(x:[1,L) \rightarrow \mathbb{T}\) where \(\mathbb{T}\) is the state space of the Markov process. In our case, \(\mathbb{T}\) represents the set of genealogical trees over \(n\) sequences. Our goal is to sample from the posterior distribution of genealogical trees.

We extend the notion of a particle filter to Markov jump processes which are continuous in both time (sequence) and space (the uncountable set \(\mathbb{T}\)). In this case, we are able to use Radon-Nikodym derivates as opposed to ratios of probability distributions.

Sampling Importance Resampling

Particle filters build up a sample from the posterior sequentially using the notion of Monte Carlo approximation.

Suppose we want to find the expectation of some function \(f\) and some hidden variable \(z\). We would like to draw samples from the posterior distribution \(p(z_n | X_n)\) where \(X_n = (x_1, \dots, x_n)\) are observed. Using Monte Carlo approximation:

\[\begin{split}\mathbb{E}_{\color{red}{p}} \left[ f \left( z_n \right) \right] &= \int f(z_n) p \left( z_n | X_n \right) dz_n \\ &\approx \sum^L_{l=1} w^{(l)}_n f \left( z^{(l)}_n \right)\end{split}\]

where the approximation holds as \(n \rightarrow \infty\). Sampling weights \(\left\{ w^{(l)}_n \right\}\) are defined as

\[w^{(l)}_n = \frac{p \left(X_n | z^{(l)}_n \right)}{ \sum^L_{m=1} p \left( x_n | z^{(m)}_n \right)}\]

and the set \(\left\{ z^{(l)}_n \right\}\) along with their weights represent the posterior \(p(z_n|x_n)\). However, in many cases the true distribution \(p\) is difficult to sample from, and importance sampling is used. In IS, samples are taken from a so-called “proposal” distribution \(q\) and weights are adjusted accordingly such that

\[\begin{split}\mathbb{E}_{\color{red}{q}} \left[ f (z_n) \right] &= \int f(z_n) \frac{p(z_n | X_n) }{ q(z_n | X_n)} q(z_n | X_n) dz_n \\ &\approx \sum^L_{l=1} \tilde{w}^{(l)}_n f \left( z^{(l)}_n \right)\end{split}\]

With new weights

\[\tilde{w}^{(l)}_n = w^{(l)}_n \frac{p(z_n | X_n) }{ q(z_n | X_n)}\]

And remembering that in our case, the above case would use Radon-Nikodym derivatives as opposed to ratios of probability distributions; for brevity we have illustrated the discrete case.

Using a very similar procecure, we can sample from \(p(z_{s+1} | z_{1:s})\) and adjust weights to approximate \(p(z_{s+1} | x_{s+1})\). See the relevant section in the preprint for more information on this step.

However, because samples from \(q\) generally are not close to the target distribution \(p\), the proportion of particles which are close to the posterior’s mode diminishes exponentially each iteration. To address this, a resampling step draws samples from the proposal distribution, assinging each a weight of \(\frac{1}{N}\). However, in order to avoid unnessarily inflating the variation at the current time, resampling is only performed when the effective sample size (ESS) drops below a certain threshold.

The “waypoints”, or positions at which we consider resampling, require some attention. Too few waypoints will increase the variation of the approximation, while too many will impact computational efficiency without any gains in accuracy. We derive a criteria that avoids particle degeneracy and find that if waypoints are considered at each observation and additional waypoints are added such that waypoints are never more than \(\frac{1}{\sqrt{2 \sigma^2}\) then the ESS will not drop more than \(\sqrt{\frac{1}{e}}\) between waypoints. This avoids degeneracy.

In this way, we sequentially build up an approximation of the posterior.

A Lookahead Likelihood

Not every particle is equally important for approximating the posterior. Upweighting particles which will be relevant in the future at the expense of those only relevant to the current time will increase the overall accuracy of the algorithm. We extend Pitt and Shephard’s discrete time Auxiliary Particle Filter to our continuous-time case.

We evaluate the likelihood (though this term is used loosely) of the current state \(x^{(i)}_s\) at some future point which is not too far beyond the current point, though how far ahead will depend on how well this lookahead distribution fits the true distribution.

We modify the above sampling algorithm to incorporate a second set of “resampling weights” which incorporate this lookahead likelihood. When resampling is performed, particles are weighted by this second, informed, weight, to better adapt to future variation.

In practice, we require a tractable approximate likelihood of future data conditioned on a given genealogy. Several simplifications are neccessary, and we primarily incorporate singletons and doubletons which are informative about topology near the tips of the genealogy. We primarily look at the distance \(s_i\) to the nearest future singleton for each sequence, and the mutually consistent cherries with their supporting doubletons. We derive an approximation of the likelihood of the current genealogy given these data.

Parameter Inference

Parameters may be inferred either by stochastic expectation maximization (SEM) or via Variational Bayes (VB). SEM maximizes the expected log likelihood over a posterior distribution of latent variables (as generated above) and gives a maximum a posteriori (MAP) estimate of parameter values \(\theta\). VB can substantially improve on SEM by iteratively estimating the posterior distirbution of \(\theta\) rather than taking a point estimate. In practice, this distribution is not reported in the current implementation, however we find that using VB is useful in avoiding fixed-point solutions in SEM.

Clustering Computing

smcsmc uses a lot of compute time. To ammeliorate this, it is highly recommend that users should make use of compute clusters, whether academic or otherwise. In this growing guide, we show how to leverage various compute cluster environments to best use smcsmc. Please help us to fill in this section if you use a different cluster computing system.

SGE / qsub

The code ships with native support for SGE clusters and usage is relatively straightforward. Procedure for PBS clusters should be almost identical, and only minor tweaking of model.py will give identical functionality.

Tip

It’s a good idea to run smcsmc in a tmux or screen session on the head node of your compute cluster. This will ensure that if your connection is interupted the program will not stopped.

  1. If you wish to use a queue other than your default, place a qsub.conf file in the same directory that you want to conduct your analysis.
  2. Specify a number of chunks in your input arguements. A good number to start with for typical human sequences is 100.
  3. Tell smcsmc that you wish to spawn cluster jobs by specifying the c option. Note that this overrides the use of multithreading. If you wish to use specific parameters for your cluster session, the C flag can be used to provide specific directives and will take priority over all instructions given in qsub.conf.

That’s it!

As an example, the following would be added to any other input arguements that you wish to use:

cluster_args = {
   'c':      '',
   'chunks': '100',
   'C':      '-P your.prc -q short.prj'
}

For more information about how input arguements are handled by smcsmc see

Tutorials

Before embarking on any of these tutorials, it is highly recommended that you set up your smcsmc installation to work alongside a compute cluster, if at all possible.

Simons Global Diversity Panel

In this tutorial we will use smcsmc to analyse the population history of a population from the Simons Global Diversity Panel.

We very arbitrarily chose to analyse the population history of the Mbuti, a South-Central African group with an ancient divergence from the rest of the continent.

Downloading and Converting Data

We will only download one chromosome of data for the sake of this tutorial.

First, make a new folder to hold your analysis of the Mbuti:

mkdir smcsmc-mbuti
cd smcsmc-mbuti

We will download a small chromosome from the phased release of the SGDP provided by the Reich Lab and save it as sgdp_phased_chr21.vcf.gz:

wget https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/phased_data/PS2_multisample_public/cteam_extended.v4.PS2_phase.public.chr21.vcf.gz -O sgdp_phased_chr21.vcf.gz

We then need to convert this sample from vcf file format to the segments which smcsmc requires. We can use smcsmc.vcf_to_seg().

import smcsmc

smcsmc.vcf_to_seg(
        [('sgdp_phased_chr21.vcf.gz', 'S_Mbuti-1')],
        output = 'mbuti_chr21.seg.gz',
        chroms = [21])

Warning

This can take a few minutes.

Now we have a list of all the segments in the Mbuti individual.

Running the Model

We are going to follow a very similar procedure as the Getting Started guide:

import os

smcsmcpath = os.path.expandvars('${CONDA_PREFIX}/bin/smcsmc')

args = {
        'EM':                           '15',
        'Np':                           '10000',

        # Submission Parameters
        'chunks':                       '100',
        'no_infer_recomb':              '',

        'smcsmcpath':                   smcsmcpath,

        # Other inference parameters
        'mu':                           '1.25e-9',
        'N0':                           '14312',
        'rho':                          '3e-9',
        'calibrate_lag':                '1.0',
        'tmax':                         '3.5',
        'alpha':                        '0',
        'apf':                          '2',
        'P':                            '133 133016 31*1',
        'VB':                           '',
        'nsam':                         '2',

        # Files
        'o':                            'run',
        'seg':                          'mbuti_chr21.seg.gz'
}

If you are able, add in the c flag to run on a cluster compute system.

Then simply run the model.

smcsmc.run_smcsmc(args)

Todo

Run this model and include a plot of the results.

Simulated Data with SCRM

Simulated Data with msprime

Another popular framework for coalescent simulation is msprime, a reimplimentation of Hudson’s ms. In this tutorial we show how to simulate data with msprime under any model and reinfer demographic parmeters with smcsmc.

Data Generation

This is not a guide for using msprime, and if you are unfamiliar with syntax, the documentation is very helpful. Here we are essentially following the analysis implemented in the PopSim analysis comparing multiple tools for demographic inference.

import msprime
from stdpopsim import homo_sapiens

# Here we use a three-population Out of Africa (OoA) model of human history.
#
# It is stored in the stdpopsim homo_sapiens class.
#
# Your model could be anything.
model = getattr(stdpopsim.homo_sapiens, "GutenkunstThreePopOutOfAfrica")

# We are interested in four individuals from the second (European) population.
#
# We take these samples in the present day.
samples = [msprime.Sample(population = 1, time = 0)] * 4

# Perform the simulation
# We pass all of the demographic events from the model as a dictionary to the simulation function.
ts = msprime.simulate(
        samples = samples,
        mutation_rate = 1.25e-8,
        **model.asdict())

# And spit out the tree file.
ts.dump('msprime.tree')

We include functions to convert from tree sequence dumps to the seg files necessary for smcsmc analysis.

import smcsmc

smcsmc.ts_to_seg('msprime.tree')

Which will write msprime.seg in the current working directory. We now, fairly straightforwardly, analysing this seg data using the same procedure outlined in Getting Started.

import smcsmc

args = {
        'EM':                           '15',
        'Np':                           '10000',

        # Submission Parameters
        'chunks':                       '100',
        'c':                            '',
        'no_infer_recomb':              '',

        # Other inference parameters
        'mu':                           '1.25e-9',
        'N0':                           '14312',
        'rho':                          '3e-9',
        'calibrate_lag':                '1.0',
        'tmax':                         '3.5',
        'alpha':                        '0',
        'apf':                          '2',
        'P':                            '133 133016 31*1',
        'VB':                           '',
        'nsam':                         '4',

        # Files
        'o':                            'run',
        'seg':                          'msprime.seg'
}

smcsmc.run_smcsmc(args)

Which creates the following output result.out:

Iter  Epoch       Start         End   Type   From     To            Opp          Count           Rate             Ne         ESS  Clump
15      0           0         133   Coal      0     -1      1646116.6       21.30762  1.2944174e-05      38627.416      1.3013     -1
15      1         133       166.2   Coal      0     -1      872269.46      26.817873  3.0744941e-05      16262.838      1.9003     -1
15      2       166.2      207.68   Coal      0     -1      1357402.8      2.1042327  1.5501903e-06      322541.04      2.1558     -1
15      3      207.68      259.53   Coal      0     -1      2109009.6      14.543791  6.8960285e-06      72505.501      2.4425     -1
15      4      259.53      324.31   Coal      0     -1      3278185.9      49.605303  1.5131937e-05      33042.696      2.7563     -1
15      5      324.31      405.26   Coal      0     -1      5071446.2         172.38  3.3990305e-05      14710.077      3.7783     -1
15      6      405.26      506.42   Coal      0     -1      7753527.6         378.89   4.886679e-05      10231.898       3.719     -1
15      7      506.42      632.82   Coal      0     -1       11654873        1049.43  9.0042165e-05       5552.954      4.6282     -1
15      8      632.82      790.79   Coal      0     -1       16851889        2312.22  0.00013720836      3644.0929      5.2067     -1
15      9      790.79      988.18   Coal      0     -1       23398842        3743.37  0.00015998099      3125.3713      6.1686     -1
15     10      988.18      1234.8   Coal      0     -1       31119159        5716.96   0.0001837119      2721.6527      7.2951     -1
15     11      1234.8      1543.1   Coal      0     -1       40005064        7390.63  0.00018474236      2706.4718      7.5514     -1
15     12      1543.1      1928.2   Coal      0     -1       50983716        8280.42  0.00016241303      3078.5707      9.1861     -1
15     13      1928.2      2409.6   Coal      0     -1       65470637        9105.62  0.00013907945      3595.0675      9.8557     -1
15     14      2409.6        3011   Coal      0     -1       85565757        9585.52  0.00011202519       4463.282      11.876     -1
15     15        3011      3762.6   Coal      0     -1  1.1401707e+08       10315.08  9.0469615e-05      5526.7174      15.182     -1
15     16      3762.6      4701.8   Coal      0     -1  1.5524053e+08       10862.25  6.9970451e-05      7145.8736      17.483     -1
15     17      4701.8      5875.5   Coal      0     -1  2.1375026e+08       12649.97  5.9181073e-05       8448.647       20.76     -1
15     18      5875.5      7342.1   Coal      0     -1  2.9163573e+08       16020.13  5.4931986e-05      9102.1649       27.88     -1
15     19      7342.1      9174.7   Coal      0     -1   3.863795e+08       21177.96  5.4811293e-05      9122.2078      34.639     -1
15     20      9174.7       11465   Coal      0     -1    4.88984e+08       28276.39  5.7826821e-05      8646.5068      43.304     -1
15     21       11465       14327   Coal      0     -1  5.8197353e+08       36240.01  6.2270891e-05      8029.4339      51.301     -1
15     22       14327       17903   Coal      0     -1   6.451339e+08       43546.96  6.7500653e-05      7407.3357      60.539     -1
15     23       17903       22372   Coal      0     -1  6.6748478e+08       47326.54  7.0902801e-05      7051.9076      68.849     -1
15     24       22372       27956   Coal      0     -1  6.4691201e+08       46774.72  7.2304609e-05      6915.1885      74.323     -1
15     25       27956       34934   Coal      0     -1  5.8593471e+08       42400.23  7.2363404e-05      6909.5699      85.549     -1
15     26       34934       43654   Coal      0     -1  4.8670448e+08       34939.75  7.1788428e-05      6964.9108      87.582     -1
15     27       43654       54551   Coal      0     -1  3.6252162e+08       25596.63  7.0607181e-05      7081.4327      93.198     -1
15     28       54551       68167   Coal      0     -1  2.3434808e+08       16196.81  6.9114327e-05        7234.39      76.176     -1
15     29       68167       85183   Coal      0     -1  1.2699601e+08        8639.48  6.8029541e-05      7349.7482      72.283     -1
15     30       85183  1.0645e+05   Coal      0     -1       54744833        3688.46  6.7375491e-05      7421.0962      66.976     -1
15     31  1.0645e+05  1.3302e+05   Coal      0     -1       17948123        1184.46  6.5993531e-05      7576.5002      65.159     -1
15     32  1.3302e+05       1e+99   Coal      0     -1      4926174.5         316.93  6.4335926e-05      7771.7076      72.307     -1
15     -1           0       1e+99  Delay     -1     -1  3.0934923e+09              0              0              0           1     -1
15     -1           0       1e+99 Recomb     -1     -1           3300  9.9000003e-06  3.0000001e-09              0           1     -1
15     -1           0       1e+99 Resamp     -1     -1  3.0934923e+09         882448  0.00028525948              0           1     -1

Of course your output will not be the same, but if you have properly set up smcsmc to use a compute cluster and given it sufficient time to execute, then the resulting trends will be highly similar.

Todo

Combine the functions to produce a plot with a guide so that I can show it here. Currently I have a SCRM plot with guide and an msprime plot from stdpopsim but not both.

Five replications of this leads to the following results:

_images/popsim1.png

API Documentation

Main Function

smcsmc.run_smcsmc(args)[source]

The main entry point to smcsmc, this function runs the whole analysis portion from start to finish. The one single argument is a dictionary of arguments as detailed on the Input Arguments.

Tip

It’s a really good idea to run smcsmc in a tmux session on the login node of your cluster if you are doing large analyses. The main loop takes very little memory, and spawns off cluster jobs if it is configured to do so (Clustering Computing).

Parameters:args (dict) – A dictionary of arguments as per Input Arguments.

An entirely equivalent entry point is the command line interface called smc2. See Command Line Interface for examples.

smcsmc requires segment files as input. Below we detail three main ways to create them.

  • From VCF files
  • From msprime sufficient tree sequences
  • From SCRM simulations

If you are using a different data type and would like help converting it to seg format, please let us know. For more details on converting files and interpreting the output of the algorithm, please see Getting Started.

Format Conversions

smcsmc.ts_to_seg(path, n=None)[source]

Converts a tree sequence into a seg file for use by smcsmc.run_smcsmcs(). This is especially useful if you are simulating data from msprime and would like to directly use it in smcsmc. For details of how to do this, please see the tutorial on simulation using msprime.

Provide the path to the tree sequence, and the suffix will be replaced by .seg. This code is adapted from PopSim.

Parameters:
  • path (str) – Full file path to the tree sequence created by ts.dump.
  • n (list) – If more than one sample of haplotypes is being analysed simulateously, provide it here as a list. Otherwise, simply provide the number of haplotypes as a single-element list.
smcsmc.vcf_to_seg(input, output, masks=None, tmpdir='.tmp', key='tmp', chroms=range(1, 23))[source]

This function converts data from a VCF to the segment type required for smcsmc. You can also (optionally) include masks for the VCF (for example, from 1000 genomes) to indicate callable regions. This function first creates a number of intermediary VCF files in tmpdir, identifiable by their key and sample IDs. The function, by default, will loop over all chromosomes and create seperate seg.gz files for each of them, however you can specify only a subset of chromosomes by providing a list to chroms.

It is preferable for VCFs to be phased, but it is by no means necessary. Phasing helps to improve the effectiveness of the lookahead likelihood.

Warning

This function does not take a lot of memory, but it can run for quite a while with genome-scale VCFs.

The format of the input argument is as follows:

input = [
    ("path_to_vcf", "sample_ID_1"),
    ("path_to_vcf", "sample_ID_2)
]

For each individual that you would like to include in the segment file, you must specify its identifier and the path to its VCF. This means, in some cases, that you will be repeating the VCF paths. That’s okay. Give the pairing of the VCF path and Sample ID (in that order) as a tuple, and give the list of tuples as the input argument. This is the same order in which individual’s genotypes will appear in the seg file.

Chromosomes

If you have one VCF with many chromsoomes, specify the chromosomes that you want to use in the anlaysis in the chrom option, as mentioned above. If you have many VCFs for each chromosome, you will have to run this function separately for each, specifying the correct chromsome each time.

Masks

Masks are bed files which indicate the callable regions from a VCF file, and may be included. If you do include masks, please make sure to include at least as many masks as individuals.

Example

If you wanted to convert two individuals from two different VCFs, whilst specifying masks, for chromosomes 5,6, and 7:

smcsmc.vcf_to_seg(
    input = [
        ("/path/to/vcf1", "name_of_individual_1"),
        ("/path/to/vcf2", "name_of_individual_2")],
    output = "chrs567.seg.gz",
    masks = [
        "/path/to/mask1.bed.gz",
        "/path/to/mask2.bed.gz"],
    key = "chrs567",
    chroms = [5,6,7]
)

Which would create chr567.seg.gz in the current working directory.

Parameters:
  • input (list) – List of tuples, where the first element is the path to the VCF file and the second is the individual to be included.
  • output (str) – Path to the output segment file. Gzipped if the suffix indicates so.
  • masks (list) – List of masks for each of the individuals given in input. These are positive masks. Masks are given as bed files, optionally gzipped.
  • tempdir (str) – Directory to write intermediary vcf files. This is not cleaned after runs, so make sure you know where it is! This is to preserve the files for any further conversions.
  • key (str) – Unique identifier of this run.
  • chroms (list) – Either a list or range of chromosomes codes to use in this particular run.

Todo

It would be good to have a “cleanup = True” option to get rid of the intermediary files. This would be highly inefficient for rerunning but maybe worth it for some people.

Returns:Nothing.

Plotting

smcsmc.plot.plot_migration(input='result.out', output='result.png', g=30, ymax=0.00025)[source]

Plot the migration between two groups.

The function pulls from the input, which must be some sort of aggregated output from smcsmc, and plots the migration rate over time for a given generational time g, which scales the number of generations to years. The Y axis (which can be modified with the ymax parameter) represents \(m_{ij}\), or the proportion of population j being placed by migration from population i backwards in time. This is important, as most intution about migration is understood forward in time.

This is a barebones function and there is no option to provide a “truth” bar. See the other plotting functions for various definitions of “truth”.

As an example, here is the resulting (symmetric) migration from a simulation:

_images/plot_migration.png

Todo

Typo above, it is not log years, rather years and the axis is logged.

Parameters:
  • input (str) – This must be the file path to some sort of aggregated output from smcsmc. This means that it can be either a chunkfinal.out or result.out but not an individual chunk. We need all the data here.
  • output (str) – Filepath to save the plot.
  • g (int) – The length of one generation in years. This is used to scale the x axis.
  • ymax (float) – The maximum y value to plot. This is used to scale the plots up or down.
smcsmc.plot.plot_rainbow(input, output, g=30, model=None, steps=None, pop_id=1)[source]

Creates a plot of all iterations by colour. This plot is useful for assessing convergence.

Parameters:
  • input (str) – The full file path to an aggregated output file from smcsmc.
  • output (str) – Filepath to save the plot.
  • g (int) – The length of one generation in years. This is used to scale the x axis.
  • model (stdpopsim.Model) – Model for plotting.
  • ymax (float) – The maximum y value to plot. This is used to scale the plots up or down.
  • steps (int) – Don’t worry about this.
  • pop_id (int) – If your model includes multiple populations, which one do you want to plot?
_images/rainbow.png
smcsmc.plot.plot_with_guide(input, guide, output, g=30, ymax=0.00025, N0=14312)[source]

This function is very similar to smcsmc.plot.plot_migration() except that it includes the ability to add a bar of “truth”. In this case, the function uses a specific form of “truth” generated from recording all epochs output by SCRM. Additionally, we provide both the effective population size and the migration rates.

The structure of the truth guide is like so:

Start Time Pop_1 Ne Pop_2 Ne Pop_1 M Pop_2 M
0 3 3 6 2
100 0.65 0.3 14 0
. . . .
10000 0.4 0.4 8 6

Saved as a CSV file for the guide argument.

Todo

This function is part of a WIP tutorial on simulating with SCRM. More details and convenience functions to come.

Parameters:
  • input (str) – The full file path to an aggregated output file from smcsmc.
  • guide (str) – The full file path to a CSV formated as above with the truth of a simulation.
  • output (str) – Filepath to save the plot.
  • g (int) – The length of one generation in years. This is used to scale the x axis.
  • ymax (float) – The maximum y value to plot. This is used to scale the plots up or down.

Command Line Interface

An entirely equivalent command line interface (CLI), called smc2, may be used in place of the python API. It is installed into the conda bin (and alternatively given in the root of the repository).

For a full listing of arguments, type smc2 -h into a terminal, or alternatively see Input Arguments.

Glossary

Demographic Parameters
Effective population size and migration rates discretised over a given number of epochs. Within each of these epochs, the rates are assumed to be constant.
Rainbow Plot
A plot used to assess the convergence of EM. The demographic parameters inferred from each epoch are plotted in a seperate colour to visually determine if significant improvement could be made by letting the model run more iterations.
Particle
The basic unit of a particle filter, in our case a particle represents an ancestral recombination graph simulated by SCRM. Particles are simulated, and updated along the sequence by genetic information that they encounter. Particles are resampled once the effective sample size of the population reaches a given threshold and are weighted according to their approximate likelihood. Asymtotically, this form of resampling importance sampling approaches the true posterior.
Particle Filter
Particle filters, or Sequential Monte Carlo (SMC) methods, are a class of algorithms which use sampling importance resampling (SIR) to generate samples from the posterior distribution of a latent variable. In our case, this is the set of trees along the sequence. The posterior is approximated using weighted random samples denoted as “particles” drawn from a known, tractable distribution. See Tulsyan, Gopaluni, and Khare 2016 for a very readable review of the basic principles behind particle filters for inference.
Sequentially Markovian Coalescent
This may refer to one of a number of variations on McVean and Cardin 2005. The authors derive an approximation of the coalescent with recombination in which lineages with no overlapping ancestral material may not coalesce. This approximation leads to a Markovian state space of genealogies along the sequence, and allows for tractable inference.
Variational Bayes
A method used to infer the posterior distribution of demographic parameters from a set of particles. Variation Bayes is quite similar to stochastic Expectation Maximation, except that it generates a full posterior distribution of parameters rather than a single most probable value.

Changelog

  • 1.0.1: WIP
    • Introduced a function to convert from VCF to seg file formats and two tests to ensure that it is working.
    • Significantly improved the API documentation.
  • 1.0: June 18, 2019
    • Initial release!