LCODE 3D documentation¶
If you are interested in not just what LCODE does, but also how does it do it, and you are reading this documentation for the first time, the recommended procedure is to glance over the first chapter and then dive straight into the source code switching back and forth using [source] and [docs] cross-links. Trust us, at 500 lines of code it’s even shorter than the docs, and if you get stuck, you always have a link back to the explanations.
Problem¶
Objective¶
LCODE 3D calculates the plasma response to an ultrarelativistic charged beam.
Simulating particle beams is definitely planned for the future.
Geometry¶
Quasistatic approximation is employed, with time-space coordinate \(\xi = z - ct\).
From the perspective of the beam, \(\xi\) is a space coordinate. The head of the beam corresponds to \(\xi = 0\), with its tail extending into lower, negative values of \(\xi\).
From the perspective of a plasma layer, penetrated by the beam, \(\xi\) is a time coordinate. At \(\xi = 0\) the plasma layer is unperturbed; as the beam passes through it, \(\xi\) values decrease.
The remaining two coordinates \(x, y\) are way more boring [Simulation window and grids].
The problem geometry is thus \(\xi, x, y\).
Beam¶
The beam is currently simulated as a charge density function \(\rho_b(\xi, x, y)\), and not with particles [Beam].
Plasma¶
Only the electron motion is simulated, the ions are represented with a static backround charge density [Background ions].
The plasma is simulated using a PIC method with an optional twist: only a ‘coarse’ grid of plasma (think 1 particle per 9 cells) is stored and evolved, while ‘fine’ particles (think 4 per cell) are bilinearly interpolated from it during the deposition [Coarse and fine plasma]. The plasma is effectively made not from independent particles, but from a fabric of ‘fine’ TSC-2D shaped particles.
Fields¶
Both the plasma movement and the ‘external’ beam contribute to the charge density/currents \(\rho, j_x, j_y, j_z\) [Deposition].
The fields are calculated from their derivatives. Theoretically, the equations are
where indices \(e, i, b\) represent electrons, ions and beam respectively.
Note
In reality, things are not that simple.
\(E_z\) and \(B_z\) calculations is relatively straightforward and boils down to solving the Laplace and Neumann equation with Dirichlet boundary conditions respectively.
The transverse fields are actually obtained by solving the Helmholtz equation with mixed boundary conditions, and then doing some more magic on top of that (so refer to Ez, Ex, Ey, Bx, By and Bz for the equations that we really solve).
Step¶
The \(\xi\)-cycle idea consists of looping these three actions:
- depositing plasma particles (and adding the beam density/current),
- calculating the new fields and
- moving plasma particles,
executed several times for each step in a predictor-corrector scheme [Looping in xi].
Trickery index¶
Geometry¶
- Quasistatic approximation reduces the problem dimensionality.
Todo
DOCS: write a separate page on the topic or link somewhere from the overview.
Numerical stability¶
- Helmholtz equation increases numerical stability; optional, but highly recommended.
- ‘Variant A’ increases numerical stability; optional.
- Coarse/fine plasma approach increases numerical stability; optional.
- Offset-coordinate separation (probably) helps with float precision.
Simplifications¶
- Reflection boundary is closer than the field calculation boundary to simplify boundary handling.
Simulation window and grids¶
Fields and densities grid¶

The grid on which the fields and densities are calculated.
Fields and densities (\(ro\), \(j\)) are calculated on a
config.grid_steps
x config.grid_steps
-sized grid.
This number must be odd in order to have an on-axis cell
for on-axis diagnostics.
-
config_example.
grid_steps
= 641¶ Transverse grid size in cells
-
config_example.
grid_step_size
= 0.025¶ Transverse grid step size in plasma units
The fields are calculated at the centers of the grid cells.
Note
The muddy concept of ‘window width’ is no longer referenced in LCODE 3D
to ease up the inevitable confusion about what it actually means
and how it relates to config.grid_step_size
.
Please refrain from thinking in these terms
and head over to the following subsection for more useful ones.
Reflect and ‘plasma’ boundaries¶

The reflect and plasma boundaries illustrated.
-
config_example.
reflect_padding_steps
= 5¶ Plasma reflection <-> field calculation boundaries
-
config_example.
plasma_padding_steps
= 10¶ Plasma placement <-> field calculation boundaries
The plasma partlcles are not allowed to enter the outermost cells
in order to simplify the treatment of boundary regions
during interpolation, deposition and field calculation
[Zero special boundary treatment].
In order to achieve that, the reflection boundary is placed
config.reflection_padding_steps
steps deeper into the simulation area.
Note
While choosing the value for this parameter, one should take into account the particle size. Even a single fine [Coarse/fine plasma] particle is three cells wide in deposition, [Plasma], so the gap width should be wide enough to cover the entire coarse particle cloud size. Failure to meet this condition may result in a memory violation error. This could be solved by introducing fine particle reflection, but that’d be more resource-intensive.
Note that while it defines the area where plasma is allowed to be present,
it must be larger than the area where the plasma is initially positioned.
The size of the second area is controlled by the
config.plasma_padding_steps
,
which puts a cut-off limit on the placement
of both coarse and fine plasma particles.
Coarse and fine plasma grids¶
Finally, the plasma boundary hosts two grids of particles, the coarse grid and the fine grid [more info in Coarse/fine plasma approach], which are coarser and finer than the field grid respectively.
Units¶
Todo
DOCS (Lotov)
Installation¶
Common¶
LCODE 3D requires an NVIDIA GPU with CUDA support. CUDA Compute Capability 6+ is strongly recommended for accelerated atomic operations support.
On the Python front, it needs Python 3.6+ and the packages listed in requirements.txt
:
cupy>=5.1
matplotlib>=1.4
numba>=0.41
numpy>=1.8
scipy>=0.14
Most of them are extremely popular
and the only one that may be slightly problematic to obtain due to its ‘teen age’ is cupy
.
Linux, distribution Python¶
All the dependencies, except for, probably, cupy
,
should be easily installable with your package manager.
Install NVIDIA drivers and CUDA packages according to your distribution documentation.
Install cupy
according to the
official installation guide,
unless 5.1
or newer is already packaged by your distribution.
Linux, Anaconda¶
All dependencies, including cupy
, are available from the official conda channels.
conda install cupy
or, if you are a miniconda user or a thorough kind of person,
while read req; do conda install --yes $req; done < requirements.txt
You probably still need to install NVIDIA drivers and CUDA packages, follow your distribution documentation.
Linux, NixOS¶
nix-shell
In case it’s not enough, consider either switching to NVIDIA driver, or simply adding
boot.kernelPackages = pkgs.linuxPackages; # explicitly require stable kernel
boot.kernelModules = [ "nvidia-uvm" ]; # should bring just enough kernel support for CUDA userspace
to /etc/nixos/configuration.nix
and rebuilding the system.
Linux, locked-down environment¶
If want to, e.g., run LCODE 3D on a cluster without permissions to install software the proper way, please contact the administrator first and refer them to this page.
If you are sure about CUDA support and you absolutely want to install the dependencies yourself,
then make sure you have Python 3.6+ and
try to install cupy
using the
official installation guide.
If you succeed, install all the other missing requirements with pip
’s
‘User Installs’ feature.
You mileage may vary. You’re responsible for the cleanup.
Windows, Anaconda¶
If cupy
5.1
or newer has already hit the channels, you’re in luck.
Just conda install cupy
, and you should be good to go.
If https://anaconda.org/anaconda/cupy still shows ‘win-64’ at v4.1.0
,
please accept our condolences and proceed to the next subsection.
Windows, the hard way¶
- Ensure that you have Python 3.6+.
- Free up some 10 GiB of disk space or more.
- Verify that you’re on good terms with the deity of your choice.
- Install Visual Studio (Community Edition is fine) with C++ support.
- Install NVIDIA CUDA Toolkit.
- Follow the
cupy
installation guide. - Prefer installing precompiled packages, but you might also try installing from source.
- Verify that it works by executing
import cupy; (cupy.asarray([2])**2).get()
in Python shell. - Install the other dependencies.
Known to work¶
As of early 2019, LCODE 3D is developed and known to work under:
- NixOS 19.03 “Koi”
- Debian 10 “Buster” + Anaconda 2019.03
- Windows 10 1903 + Anaconda 2019.03
Running and embedding¶
Only two files are required¶
LCODE 3D is a single-file module and you only need two files to execute it:
lcode.py
and config.py
.
Installing LCODE into PYTHONPATH
with the likes of pip install .
is possible,
but is not officially supported.
Configuration¶
LCODE 3D is configured by placing a file config.py
into the current working directory.
An example is provided as config_example.py
.
The file gets imported by the standard Python importing mechanism,
the resulting module is passed around internally as config
.
One can use all the features of Python inside the configuration file, from arithmetic expressions and functions to other modules and metaprogramming.
Ez¶
Equations¶
We want to solve
with Dirichlet (zero) boundary conditions.
Method¶
The algorithm can be succinctly written as iDST2D(dirichlet_matrix * DST2D(RHS))
,
where DST2D
and iDST2D
are
Type-1 Forward and Inverse Discrete Sine 2D Trasforms respectively,
RHS
is the right-hand side of the equiation above,
and dirichlet_matrix
is a ‘magical’ matrix that does all the work.
-
lcode.
dirichlet_matrix
(grid_steps, grid_step_size)[source]¶ Calculate a magical matrix that solves the Laplace equation if you elementwise-multiply the RHS by it “in DST-space”. See Samarskiy-Nikolaev, p. 187.
In addition to the magic values, it also hosts the DST normalization multiplier.
Todo
DOCS: expand with method description (Kargapolov, Shalimova)
DST2D¶
-
lcode.
dst2d
(a)[source]¶ Calculate DST-Type1-2D, jury-rigged from anti-symmetrically-padded rFFT.
As
cupy
currently ships no readily available function for calculating the DST2D on the GPU, we roll out our own FFT-based implementation.We don’t need to make a separate iDST2D function as (for Type-1) it matches DST2D up to the normalization multiplier, which is taken into account in
dirichlet_matrix()
.
Ex, Ey, Bx, By¶
Theory¶
We want to solve
with mixed boundary conditions.
Todo
DOCS: specify the boundary conditions.
Unfortunately, what we actually solve is no less than three steps away from these.
Helmholtz equations¶
The harsh reality of numerical stability forces us to solve these Helmholtz equations instead:
Note
The behaviour is actually configurable with
config.field_solver_subtraction_trick
(what a mouthful).
0
or False
corresponds to Laplace equation,
and while any floating-point values or whole matrices of them should be accepted,
it’s recommended to simply use 1
or True
instead.
-
config_example.
field_solver_subtraction_trick
= 1¶ 0 for Laplace eqn., Helmholtz otherwise
Method¶
The algorithm can be succinctly written as iMIX2D(mixed_matrix * MIX2D(RHS))
,
where MIX2D
and iMIX2D
are
Type-1 Forward and Inverse Discrete Trasforms, Sine in one direction and Cosine in the other.
RHS
is the right-hand side of the equiation above,
and dirichlet_matrix
is a ‘magical’ matrix that does all the work.
-
lcode.
mixed_matrix
(grid_steps, grid_step_size, subtraction_trick)[source]¶ Calculate a magical matrix that solves the Helmholtz or Laplace equation (subtraction_trick=True and subtraction_trick=False correspondingly) if you elementwise-multiply the RHS by it “in DST-DCT-transformed-space”. See Samarskiy-Nikolaev, p. 189 and around.
Todo
DOCS: expand with method description (Kargapolov, Shalimova)
-
lcode.
calculate_Ex_Ey_Bx_By
(config, Ex_avg, Ey_avg, Bx_avg, By_avg, beam_ro, ro, jx, jy, jz, jx_prev, jy_prev)[source]¶ Calculate transverse fields as iDST-DCT(mixed_matrix * DST-DCT(RHS.T)).T, with and without transposition depending on the field component.
Note that some outer cells do not participate in the calculations, and the result is simply padded with zeroes in the end. We don’t define separate functions for separate boundary condition types and simply transpose the input and output data.
DST-DCT Hybrid¶
-
lcode.
mix2d
(a)[source]¶ Calculate a DST-DCT-hybrid transform (DST in first direction, DCT in second one), jury-rigged from padded rFFT (anti-symmetrically in first direction, symmetrically in second direction).
As
cupy
currently ships no readily available function for calculating even 1D DST/DCT on the GPU, we, once again, roll out our own FFT-based implementation.We don’t need a separate function for the inverse transform, as it matches the forward one up to the normalization multiplier, which is taken into account in
mixed_matrix()
.
Variant B¶
But wait, the complications don’t stop here.
While we do have a succesfully implemented \((\Delta_\perp - 1)\) inverse operator, there’s still an open question of supplying an unknown value to the RHS.
The naive version (internally known as ‘Variant B’) is to pass the best known substitute to date, i.e. previous layer fields at the predictor phase and the averaged fields at the corrector phase. \(\xi\)-derivatives are taken at half-steps, transverse derivatives are averaged at half-steps or taken from the previous layer if not available.
Variant A¶
The more correct version (known as ‘Variant A’) mutates the equations once again to take everything at half-steps:
and calculates the fields at next step in the following fashion: \(E_x^{next} = 2 E_x^{avg} - E_x^{prev}\), e.t.c.
Solving these is equivalent to solving Variant B equations
with averaged fields, \(\rho\) and \(j_z\) and applying the above transformation to the result.
See lcode.step()
for the wrapping code that does that.
-
config_example.
field_solver_variant_A
= True¶ Use Variant A or Variant B for Ex, Ey, Bx, By
Bz¶
Equations¶
We want to solve
with Neumann boundary conditions (derivative = 0).
Method¶
The algorithm can be succinctly written as iDCT2D(neumann_matrix * DCT2D(RHS))
,
where DCT2D
and iDCT2D
are
Type-1 Forward and Inverse Discrete Sine 2D Trasforms respectively,
RHS
is the right-hand side of the equiation above,
and neumann_matrix
is a ‘magical’ matrix that does all the work.
-
lcode.
neumann_matrix
(grid_steps, grid_step_size)[source]¶ Calculate a magical matrix that solves the Laplace equation if you elementwise-multiply the RHS by it “in DST-space”. See Samarskiy-Nikolaev, p. 187.
In addition to the magic values, it also hosts the DCT normalization multiplier.
Todo
DOCS: expand with method description (Kargapolov, Shalimova)
DCT2D¶
-
lcode.
dct2d
(a)[source]¶ Calculate DCT-Type1-2D, jury-rigged from symmetrically-padded rFFT.
As
cupy
currently ships no readily available function for calculating the DCT2D on the GPU, we roll out our own FFT-based implementation.We don’t need to make a separate iDCT2D function as (for Type-1) it matches DCT2D up to the normalization multiplier, which is taken into account in
neumann_matrix()
.
Plasma¶
Characteristics¶
A plasma particle has these characteristics according to our model:
- Coordinates \(x\) and \(y\), stored as
x_init + x_offt
andy_init + y_offt
[Offset-coordinate separation]. - Momenta \(p_x\), \(p_y\) and \(p_z\), stored as
px
,py
andpz
. - Charge \(q\), stored as
q
. - Mass \(m\), stored as
m
.
Shape¶
From the interpolation/deposition perspective, a plasma particle represents not a point in space, but a 2D Triangular-Shaped Cloud (TSC2D).
These clouds always (partially) cover an area the size of \(3x3\) cells: the one where their center lies and eight neighouring ones.

Todo
DOCS: WRITE: write a nicer formula for the weights of each cell.
-
lcode.
weights
(x, y, grid_steps, grid_step_size)[source]¶ Calculate the indices of a cell corresponding to the coordinates, and the coefficients of interpolation and deposition for this cell and 8 surrounding cells. The weights correspond to 2D triangluar shaped cloud (TSC2D).
The same coefficients are used for both deposition of the particle characterictics onto the grid [Deposition] and interpolation of the fields in the particle center positions [Plasma pusher].
-
lcode.
deposit9
(a, i, j, val, wMP, w0P, wPP, wM0, w00, wP0, wMM, w0M, wPM)[source]¶ Deposit value into a cell and 8 surrounding cells (using weights output).
-
lcode.
interp9
(a, i, j, wMP, w0P, wPP, wM0, w00, wP0, wMM, w0M, wPM)[source]¶ Collect value from a cell and 8 surrounding cells (using weights output).
The concept is orthogonal to the coarse plasma particle shape [Coarse and fine plasma]. While a coarse particle may be considered to be a component of an elastic cloud of fine particles, each individial fine particle sports the same TSC2D shape.
Coarse and fine plasma¶
Concept¶
In order to increase stability and combat transverse grid noise, LCODE 3D utilises a dual plasma appoach.

Positioning of the coarse and fine particles in dual plasma approach.
Coarse particles are the ones that get tracked throughout the program, and pushed by the pusher. There coarse plasma grid is many times more sparse than the fields grid, think \(\frac{1}{9}\) particles per cell.
-
config_example.
plasma_coarseness
= 3¶ Square root of the amount of cells per coarse particle
Fine particles only exist inside the deposition phase. There are several fine particles per cell, think \(4\) or more. Their characteristic values are neither stored or evolved; instead they are intepolated from the coarse particle grid as a part of the the deposition process (and they don’t ‘exist’ in any form outside of it).
-
config_example.
plasma_fineness
= 2¶ Square root of the amount of fine particles per cell
Initialization¶
-
lcode.
make_coarse_plasma_grid
(steps, step_size, coarseness=3)[source]¶ Create initial coarse plasma particles coordinates (a single 1D grid for both x and y).
-
lcode.
make_fine_plasma_grid
(steps, step_size, fineness=2)[source]¶ Create initial fine plasma particles coordinates (a single 1D grid for both x and y).
Avoids positioning particles at the cell edges and boundaries.
fineness=3
(andcoarseness=2
):+-----------+-----------+-----------+-----------+ | . . . | . . . | . . . | . . . | | | | | | . - fine particle | . . . | . * . | . . . | . * . | | | | | | * - coarse+fine particle | . . . | . . . | . . . | . . . | +-----------+-----------+-----------+-----------+
fineness=2
(andcoarseness=2
):+-------+-------+-------+-------+-------+ | . . | . . | . . | . . | . . | . - fine particle | | * | | * | | | . . | . . | . . | . . | . . | * - coarse particle +-------+-------+-------+-------+-------+
-
lcode.
make_plasma
(steps, cell_size, coarseness=3, fineness=2)[source]¶ Make coarse plasma initial state arrays and the arrays needed to intepolate coarse plasma into fine plasma (
virt_params
).Coarse is the one that will evolve and fine is the one to be bilinearly interpolated from the coarse one based on the initial positions (using 1 to 4 coarse plasma particles that initially were the closest).
Initializing coarse particles is pretty simple:
coarse_x_init
andcoarse_y_init
are broadcasted output ofmake_coarse_plasma_grid()
.coarse_x_offt
andcoarse_y_offt
are zeros and so arecoarse_px
,coarse_py
andcoarse_pz
.coarse_m
andcoarse_q
are constants divided by the factor of coarseness by fineness squared because fine particles represent smaller macroparticles.Initializing fine particle boils down to calculating the interpolation coefficients (
influence_prev
andinfluence_next
) and the indices of the coarse particles (indices_prev
,indices_next
) that the characteristics will be intepolated from.influence_prev
andinfluence_next
are linear interpolation coefficients based on the initial closest coarse particle positioning. Note that these are constant and do not change in \(\xi\). The edges get special treatment later.indices_next
happens to be pretty much equal tonp.searchsorted(coarse_grid, fine_grid)
andindices_prev
is basicallyindices_next - 1
, except for the edges, where a fine particle can have less than four ‘parent’ coarse particles. For such ‘outer’ particles, existing coarse particles are used instead, so clipping the indices and fixinginfluence
-arrays is carried out.Note that these arrays are 1D for memory considerations [Memory considerations].
The function returns the coarse particles and
virtparams
: aGPUArrays
instance that conveniently groups the fine-particle related arrays, which only matter during deposition, under a single name.
Coarse-to-fine interpolation¶

-
lcode.
mix
(coarse, A, B, C, D, pi, ni, pj, nj)[source]¶ Bilinearly interpolate fine plasma properties from four historically-neighbouring plasma particle property values:
B D # y ^ A - bottom-left neighbour, indices: pi, pj . # | B - top-left neighbour, indices: pi, nj # +----> C - bottom-right neighbour, indices: ni, pj A C # x D - top-right neighbour, indices: ni, nj
See the rest of the deposition and plasma creation for more info.
This is just a shorthand for the characteristic value mixing for internal use in
coarse_to_fine
.

-
lcode.
coarse_to_fine
(fi, fj, c_x_offt, c_y_offt, c_m, c_q, c_px, c_py, c_pz, virtplasma_smallness_factor, fine_grid, influence_prev, influence_next, indices_prev, indices_next)[source]¶ Bilinearly interpolate fine plasma properties from four historically-neighbouring plasma particle property values.
The internals are pretty straightforward once you wrap your head around the indexing.
A single fine particle with the indices
[fi, fj]
(in fine particlesvirt_params
1D arrays) is interpolated from four particles with indices[pi, pj]
,[pi, nj]
,[ni, pj]
,[ni, nj]
(in coarse particlesc_*
arrays) and four weightsA
,B
,C
,D
respectively. The weights are, in turn, composed as a products of values frominfluence_prev
andinfluence_next
arrays, indiced, once again, with[fi, fj]
. It would be convenient to calculate them beforehand, but they are recalculated instead as a result of time-memory tradeoff [Memory considerations].Finally, momenta, charge and mass are scaled according to the coarse-to-fine macrosity coefficient discussed above.
Deposition¶
Deposition operates on fine particles. Once the coarse-to-fine interpolation is out of the picture, there isn’t much left to discuss.
-
lcode.
deposit_kernel
(grid_steps, grid_step_size, virtplasma_smallness_factor, c_x_offt, c_y_offt, c_m, c_q, c_px, c_py, c_pz, fine_grid, influence_prev, influence_next, indices_prev, indices_next, out_ro, out_jx, out_jy, out_jz)[source]¶ Interpolate coarse plasma into fine plasma and deposit it on the charge density and current grids.
First, the fine particle characteristics are interpolated from the coarse ones. Then the total contribution of the particles to the density and the currents is calculated and, finally, deposited on a grid in a 3x3 cell square with
i
,j
as its center according to the weights calculated byweights()
. Finally, the ion background density is added to the resulting array.The strange incantation at the top and the need to modify the output arrays instead of returning them are dictated by the fact that ihis is actually not a function, but a CUDA kernel (for more info, refer to CUDA kernels with numba.cuda). It is launched in parallel for each fine particle, determines its 2D index (
fi
,fj
), interpolates its characteristics from coarse particles and proceeds to deposit it.
-
lcode.
deposit
(config, ro_initial, x_offt, y_offt, m, q, px, py, pz, virt_params)[source]¶ Interpolate coarse plasma into fine plasma and deposit it on the charge density and current grids. This is a convenience wrapper around the
deposit_kernel
CUDA kernel.This function allocates the output arrays, unpacks the arguments from
config
andvirt_params
, calculates the kernel dispatch parameters (for more info, refer to CUDA kernels with numba.cuda), and launches the kernel.
Todo
DOCS: explain deposition contribution formula (Lotov)
Background ions¶
LCODE 3D currently simulates only the movement of plasma electrons. The ions are modelled as a constant charge density distribution component that is calculated from the inital electron placement during the initialization.
-
lcode.
initial_deposition
(config, x_offt, y_offt, px, py, pz, m, q, virt_params)[source]¶ Determine the background ion charge density by depositing the electrons with their initial parameters and negating the result.
For this initial deposition invocation, the ion density argument is specified as
0
.
The result is stored as const.ro_initial
and passed to every consequtive lcode.deposit()
invocation.
Plasma pusher¶
Without fields¶
The coordinate-evolving equations of motion are as follows:
-
lcode.
move_estimate_wo_fields
(config, m, x_init, y_init, prev_x_offt, prev_y_offt, px, py, pz)[source]¶ Move coarse plasma particles as if there were no fields. Also reflect the particles from +-reflect_boundary.
This is used at the beginning of the xi step to roughly estimate the half-step positions of the particles.
The reflection here flips the coordinate, but not the momenta components.
With fields¶
The coordinate-evolving equation of motion is as follows:
As the particle momentum is present at both sides of the equation (as \(p\) and \(v\) respectively), an iterative predictor-corrector scheme is employed.
The alternative is to use a symplectic solver that solves the resulting matrix equation
(not mainlined at the moment, look for an alternative branch in t184256
’s fork).
-
lcode.
move_smart_kernel
(xi_step_size, reflect_boundary, grid_step_size, grid_steps, ms, qs, x_init, y_init, prev_x_offt, prev_y_offt, estimated_x_offt, estimated_y_offt, prev_px, prev_py, prev_pz, Ex_avg, Ey_avg, Ez_avg, Bx_avg, By_avg, Bz_avg, new_x_offt, new_y_offt, new_px, new_py, new_pz)[source]¶ Update plasma particle coordinates and momenta according to the field values interpolated halfway between the previous plasma particle location and the the best estimation of its next location currently available to us. Also reflect the particles from
+-reflect_boundary
.The function serves as the coarse particle loop, fusing together midpoint calculation, field interpolation with
interp9()
and particle movement for performance reasons.The equations for half-step momentum are solved twice, with more precise momentum for the second time.
The particles coordinates are advanced using half-step momentum, and afterwards the momentum is advanced to the next step.
The reflection is more involved this time, affecting both the coordinates and the momenta.
Note that the reflected particle offsets are mixed with the positions, resulting in a possible float precision loss [Offset-coordinate separation]. This effect probably negligible at this point, as the particle had to travel at least several cell sizes at this point. The only place where the separation really matters is the (final) coordinate addition (
x_offt += ...
andy_offt += ...
).The strange incantation at the top and the need to modify the output arrays instead of returning them is dictated by the fact that ihis is actually not a function, but a CUDA kernel (for more info, refer to CUDA kernels with numba.cuda), It is launched in parallel for each coarse particle, determines its 1D index
k
, interpolates the fields at its position and proceeds to move and reflect it.
-
lcode.
move_smart
(config, m, q, x_init, y_init, x_prev_offt, y_prev_offt, estimated_x_offt, estimated_y_offt, px_prev, py_prev, pz_prev, Ex_avg, Ey_avg, Ez_avg, Bx_avg, By_avg, Bz_avg)[source]¶ Update plasma particle coordinates and momenta according to the field values interpolated halfway between the previous plasma particle location and the the best estimation of its next location currently available to us. This is a convenience wrapper around the
move_smart_kernel
CUDA kernel.This function allocates the output arrays, unpacks the arguments from
config
calculates the kernel dispatch parameters (for more info, refer to CUDA kernels with numba.cuda), flattens the input and output array of particle characteristics (as the pusher does not care about the particle 2D indices) and launches the kernel.
Beam¶
The beam is currently simulated as a charge density function \(\rho_b(\xi, x, y)\), and not with particles.
In the future, there will certaintly be a way to define a beam with particles and simulate beam-plasma interaction both ways, but for now only simulating a plasma response to a rigid beam is possible.
-
config_example.
beam
(xi_i, x, y)[source]¶ The user should specify the beam charge density as a function in the configuration file.
xi_i
is not the value of the \(\xi\) coordinate, but the step index. Please use something in the lines ofxi = -xi_i * xi_step_size + some_offset
, according to where exactly in \(\xi\) do you define the beam density slices [Integer xi steps].x
andy
arenumpy
arrays, so one should use vectorized numpy operations to calculate the desired beam charge density, likenumpy.exp(-numpy.sqrt(x**2 + y**2))
.The function should ultimately return an array with the same shape as
x
andy
.
Todo
CODE: Simulate the beam with particles and evolve it according to the plasma response.
Looping in xi¶
Finally, here’s the function that binds it all together, and currently makes up half of LCODE 3D API.
In short it: moves, deposits, estimates fields, moves, deposits, recalculates fields, moves and deposits.
-
config_example.
xi_step_size
= 0.005¶ Step size in time-space coordinate xi
-
config_example.
xi_steps
= 599999¶ Amount of xi steps
-
lcode.
step
(config, const, virt_params, prev, beam_ro)[source]¶ Calculate the next iteration of plasma evolution and response. Returns the new state with the following attributes:
x_offt
,y_offt
,px
,py
,pz
,Ex
,Ey
,Ez
,Bx
,By
,Bz
,ro
,jx
,jy
,jz
. Pass the returned value asprev
for the next iteration. Wrap it inGPUArraysView
if you want transparent conversion tonumpy
arrays.
Input parameters¶
Beam density array rho_b (beam_ro
) is copied to the GPU with cupy.asarray
,
as it is calculated with numpy
in config-residing beam()
.
All the other arrays come packed in GPUArrays
objects [GPU array conversion],
which ensures that they reside in the GPU memory.
These objects are:
const
andvirt_params
, which are constant at least for the \(\xi\)-step duration and defined during the initialization, andprev
, which is usually obtained as the return value of the previousstep()
invocation, except for the very first step.
Initial half-step estimation¶
- The particles are advanced according to their current momenta only
(
lcode.move_estimate_wo_fields()
).
Field prediction¶
While we don’t know the fields on the next step:
- The particles are advanced with the fields from the previous step
using the coordinates estimated at 1. to calculate the half-step positions
where the previous step fields should be interpolated at
(
lcode.move_smart()
). - The particles from 2. are deposited onto the charge/current density grids
(
lcode.deposit()
). - The fields at the next step are calculated using densities from 3.
(
lcode.calculate_Ez()
,lcode.calculate_Ex_Ey_Bx_By()
,lcode.calculate_Bz()
) and averaged with the previous fields.
This phase gives us an estimation of the fields at half-step, and the coordinate estimation at next step, while all other intermediate results are ultimately ignored.
Field correction¶
- The particles are advanced with the averaged fields from 4.,
using the coordinates from 2. to calculate the half-step positions
where the averaged fields from 4. should be interpolated at
(
lcode.move_smart()
). - The particles from 5. are deposited onto the charge/current density grids
(
lcode.deposit()
). - The fields at the next step are calculated using densities from 6.
(
lcode.calculate_Ez()
,lcode.calculate_Ex_Ey_Bx_By()
,lcode.calculate_Bz()
) and averaged with the previous fields.
The resulting fields are far more precise than the ones from the prediction phase, but the coordinates and momenta are still pretty low-quality until we recalculate them using the new fields. Iterating the algorithm more times improves the stability, but it currently doesn’t bring much to the table as the transverse noise dominates.
Final plasma evolution and deposition¶
- The particles are advanced with the averaged fields from 7.,
using the coordinates from 5. to calculate the half-step positions
where the averaged fields from 7. should be interpolated at
(
lcode.move_smart()
). - The particles from 8. are deposited onto the charge/current density grids
(
lcode.deposit()
).
The result, or the ‘new prev’¶
The fields from 7., coordinates and momenta from 8., and densities from 9.
make up the new GPUArrays
collection that would be passed as prev
to the next iteration of step()
.
Initialization¶
-
lcode.
init
(config)[source]¶ Initialize all the arrays needed for
step
andconfig.beam
.This function performs quite a boring sequence of actions, outlined here for interlinking purposes:
- validates the oddity of
config.grid_steps
[Fields and densities grid], - validates that
config.reflect_padding_steps
is large enough [Reflect and ‘plasma’ boundaries], - calculates the
reflect_boundary
and monkey-patches it back intoconfig
, - initializes the
x
andy
arrays for use inconfig_example.beam()
, - calculates the plasma placement boundary,
- immediately passes it to
make_plasma()
, leaving it oblivious to the padding concerns, - performs the initial electrion deposition to obtain the background ions charge density [Background ions],
- groups the constant arrays into a
GPUArray
instanceconst
, and - groups the evolving arrays into a
GPUArray
instancestate
.
- validates the oddity of
GPU calculations pecularities¶
LCODE 3D performs most of the calculations on GPU using a mix of two approaches.
CUDA kernels with numba.cuda¶
One can use CUDA from Python more or less directly by writing and launching CUDA kernels with
numba.cuda
.
An example would be:
@numba.cuda.jit
def add_two_arrays_kernel(arr1, arr2, result):
i = numba.cuda.grid(1)
if i >= arr1.shape[0]:
return
result[i] = arr1[i] + arr2[i]
This function represents a loop body, launched in parallel with many threads at once.
Each of them starts with obtaining the array index it is ‘responsible’ for with cuda.grid(1)
and then proceeds to perform the required calculation.
As it is optimal to launch them in 32-threaded ‘warps’, one also has to handle the case
of having more threads than needed by making them skip the calculation.
No fancy Python operations are supported inside CUDA kernels, it is basically a way to write C-like bodies for hot loops without having to write actual C/CUDA code. You can only use simple types for kernel arguments and you cannot return anything from them.
To rub it in, this isn’t even a directly callable function yet. To conceal the limitations and the calling complexity, it is convenient to write a wrapper for it.
def add_two_arrays(arr1, arr2):
result = cp.zeros_like(arr1) # uses cupy, see below
warp_count = int(ceil(arr1.size / WARP_SIZE))
add_two_arrays_kernel[warp_count, WARP_SIZE](arr1, arr2, result)
return result
A pair of numbers (warp_count
, WARP_SIZE
) is required to launch the kernel.
warp_count
is chosen this way so that warp_count * WARP_SIZE
would be
larger than the problem size.
-
lcode.
WARP_SIZE
= 32¶ Should be detectable with newer
cupy
(>6.0.0b2) asWARP_SIZE = cp.cuda.Device(config.gpu_index).attributes['WarpSize']
. As of 2019 it’s equal to 32 for all CUDA-capable GPUs. It’s even a hardcoded value incupy
.
Array-wise operations with cupy¶
cupy
is a GPU array library that aims to implement a numpy-like
interface to GPU arrays.
It allows one to, e.g., add up two GPU arrays with a simple and terse a + b
.
Most of the functions in LCODE use vectorized operations and cupy
.
All memory management is done with cupy
for consistency.
It’s hard to underestimate the convenience of this approach, but sometimes expressing algorithms
in vectorized notation is too hard or suboptimal.
The only two times we’re actually going for writing CUDA kernels are
deposit()
(our fine particle loop) and
move_smart()
(our coarse particle loop).
Copying is expensive¶
If the arrays were copied between GPU RAM and host RAM, the PCI-E bandwidth would become a bottleneck. The two most useful strategies to minimize excessive copying are
- churning for several consecutive \(\xi\)-steps
with no copying and no CPU-side data processing
(with a notable exception of
beam()
and the resulting beam_ro); and - copying only the subset of the arrays that the outer diagnostics code needs.
GPU array conversion¶
In order for a + b
to work in cupy
,
both arrays have to be copied to GPU (cupy.asarray(a)
) and,
in case you want the results back as numpy
arrays, you have to explicitly copy them back
(gpu_array.get()
).
While for the LCODE 3D itself it’s easier and quicker to stick to using GPU arrays exclusively,
this means the only time when we want to do the conversion to numpy
is when we are returning
the results back to the external code.
There are two classes that assist in copying the arrays back and forth and conveniently as possible. The implementation looks a bit nightmarish, but using them is simple.
-
class
lcode.
GPUArrays
(**kwargs)[source]¶ A convenient way to group several GPU arrays and access them with a dot.
x = GPUArrays(something=numpy_array, something_else=another_array)
will createx
withx.something
andx.something_else
stored on GPU.Do not add more attributes later, specify them all at construction time.
-
class
lcode.
GPUArraysView
(gpu_arrays)[source]¶ This is a magical wrapper around GPUArrays that handles GPU-RAM data transfer transparently. Accessing
view.something
will automatically copy array to host RAM, settingview.something = ...
will copy the changes back to GPU RAM.Usage:
view = GPUArraysView(gpu_arrays); view.something
Do not add more attributes later, specify them all at construction time.
NOTE: repeatedly accessing an attribute will result in repeated copying!
This way we can wrap everything we need in GPUArrays with, e.g.,
const = GPUArrays(x_init=x_init, y_init=y_init, ...)
and then access them as const.x_init
from GPU-heavy code.
For the outer code that does not care about GPU arrays at all,
we can return a wrapped const_view = GPUArraysView(const)
and access the arrays as const_view.x_init
.
Copying will happen on-demand during the attribute access,
intercepted by our __getattr__
implementation,
but beware!
Note
Repeatedly accessing const_view.x_init
will needlessly perform the copying again,
so one should bind it to the variable name (x_init = const_view.x_init
) once
and reuse the resulting numpy
array.
Todo
CODE: wrap the returned array with GPUArraysView by default
Selecting GPU¶
-
config_example.
gpu_index
= 0¶ Index of the GPU that should perform the calculations
LCODE 3D currently does not support utilizing several GPUs for one simulation, but once we switch to beam evolution calculation, processing several consecutive \(t\)-steps in a pipeline of several GPUs should be a low hanging fruit.
Optimal transverse grid sizes¶
FFT works best for grid sizes that are factorizable into small numbers. Any size will work, but the performance may vary dramatically.
FFTW documentation quotes the optimal size for their algorithm as \(2^a 3^b 5^c 7^d 11^e 13^f\), where \(e+f\) is either \(0\) or \(1\), and the other exponents are arbitrary.
While LCODE 3D does not use FFTW (it uses cufft
instead, wrapped by cupy
),
the formula is still quite a good rule of thumb
for calculating performance-friendly config_example.grid_steps
values.
The employed FFT sizes for a grid sized \(N\) are \(2N-2\)
for both DST (dst2d()
, mix2d()
) and DCT transforms (dct2d()
, mix2d()
)
when we take padding and perimeter cell stripping into account.
This leaves us to find such \(N\) that \(N-1\) satisfies the small-factor conditions.
If you don’t mind arbitrary grid sizes, we suggest using
- \(N=2^K + 1\), they always perform the best, or
- one of the roundish \(201\), \(301\), \(401\), \(501\), \(601\), \(701\), \(801\), \(901\), \(1001\), \(1101\), \(1201\), \(1301\), \(1401\), \(1501\), \(1601\), \(1801\), \(2001\), \(2101\), \(2201\), \(2401\), \(2501\), \(2601\), \(2701\), \(2801\), \(3001\), \(3201\), \(3301\), \(3501\), \(3601\), \(3901\), \(4001\).
The code to check for the FFTW criteria above and some of the matching numbers are listed below.
def factorize(n, a=[]):
if n <= 1:
return a
for i in range(2, n + 1):
if n % i == 0:
return factorize(n // i, a + [i])
def good_size(n):
factors = factorize(n - 1)
return (all([f in [2, 3, 4, 5, 7, 11, 13] for f in factors])
and actors.count(11) + factors.count(13) < 2 and
and n % 2)
', '.join([str(a) for a in range(20, 4100) if good_size(a)])
\(21\), \(23\), \(25\), \(27\), \(29\), \(31\), \(33\), \(37\), \(41\), \(43\), \(45\), \(49\), \(51\), \(53\), \(55\), \(57\), \(61\), \(65\), \(67\), \(71\), \(73\), \(79\), \(81\), \(85\), \(89\), \(91\), \(97\), \(99\), \(101\), \(105\), \(109\), \(111\), \(113\), \(121\), \(127\), \(129\), \(131\), \(133\), \(141\), \(145\), \(151\), \(155\), \(157\), \(161\), \(163\), \(169\), \(177\), \(181\), \(183\), \(193\), \(197\), \(199\), \(201\), \(209\), \(211\), \(217\), \(221\), \(225\), \(235\), \(241\), \(251\), \(253\), \(257\), \(261\), \(265\), \(271\), \(281\), \(289\), \(295\), \(301\), \(309\), \(313\), \(321\), \(325\), \(331\), \(337\), \(351\), \(353\), \(361\), \(365\), \(379\), \(385\), \(391\), \(393\), \(397\), \(401\), \(417\), \(421\), \(433\), \(441\), \(449\), \(451\), \(463\), \(469\), \(481\), \(487\), \(491\), \(501\), \(505\), \(513\), \(521\), \(529\), \(541\), \(547\), \(551\), \(561\), \(577\), \(589\), \(595\), \(601\), \(617\), \(625\), \(631\), \(641\), \(649\), \(651\), \(661\), \(673\), \(687\), \(701\), \(703\), \(705\), \(721\), \(729\), \(751\), \(757\), \(769\), \(771\), \(781\), \(785\), \(793\), \(801\), \(811\), \(833\), \(841\), \(865\), \(881\), \(883\), \(897\), \(901\), \(911\), \(925\), \(937\), \(961\), \(973\), \(981\), \(991\), \(1001\), \(1009\), \(1025\), \(1041\), \(1051\), \(1057\), \(1079\), \(1081\), \(1093\), \(1101\), \(1121\), \(1135\), \(1153\), \(1171\), \(1177\), \(1189\), \(1201\), \(1233\), \(1249\), \(1251\), \(1261\), \(1275\), \(1281\), \(1297\), \(1301\), \(1321\), \(1345\), \(1351\), \(1373\), \(1387\), \(1401\), \(1405\), \(1409\), \(1441\), \(1457\), \(1459\), \(1471\), \(1501\), \(1513\), \(1537\), \(1541\), \(1561\), \(1569\), \(1585\), \(1601\), \(1621\), \(1639\), \(1651\), \(1665\), \(1681\), \(1729\), \(1751\), \(1761\), \(1765\), \(1783\), \(1793\), \(1801\), \(1821\), \(1849\), \(1873\), \(1891\), \(1921\), \(1945\), \(1951\), \(1961\), \(1981\), \(2001\), \(2017\), \(2049\), \(2059\), \(2081\), \(2101\), \(2107\), \(2113\), \(2157\), \(2161\), \(2185\), \(2201\), \(2241\), \(2251\), \(2269\), \(2305\), \(2311\), \(2341\), \(2353\), \(2377\), \(2401\), \(2431\), \(2451\), \(2465\), \(2497\), \(2501\), \(2521\), \(2549\), \(2561\), \(2593\), \(2601\), \(2641\), \(2647\), \(2689\), \(2701\), \(2731\), \(2745\), \(2751\), \(2773\), \(2801\), \(2809\), \(2817\), \(2881\), \(2913\), \(2917\), \(2941\), \(2971\), \(3001\), \(3025\), \(3073\), \(3081\), \(3121\), \(3137\), \(3151\), \(3169\), \(3201\), \(3235\), \(3241\), \(3251\), \(3277\), \(3301\), \(3329\), \(3361\), \(3403\), \(3431\), \(3457\), \(3501\), \(3511\), \(3521\), \(3529\), \(3565\), \(3585\), \(3601\), \(3641\), \(3697\), \(3745\), \(3751\), \(3781\), \(3823\), \(3841\), \(3851\), \(3889\), \(3901\), \(3921\), \(3961\), \(4001\), \(4033\), \(4051\), \(4097\)
Offset-coordinate separation¶
Float precision loss¶
When floating point numbers of different magnitudes get added up, there is an inherent precision loss that grows with the magnitude disparity.
If a particle has a large coordinate (think 5.223426
),
but moves for a small distance (think 7.139152e-4
) due to low xi_step_size
and small momentum projection, calculating the sum of these numbers
suffers from the precision loss due to the finite significand size:
An oversimplified illustration in decimal notation:
5.223426
+0.0007139152
=5.224139LOST
We have not conducted extensive research on how detrimental this round-off accumulation
is to LCODE 3D numerical stability in \(\xi\).
Currently the transverse noise dominates,
but in order to make our implementation a bit more future-proof,
we store the plasma particle coordinates separated into two floats:
initial position (x_init
, y_init
) and accumulated offset (x_offt
, y_offt
)
and do not mix them.
Mixing them all the time…¶
OK, we do mix them. Each and every function involving them adds them up at some point and even has the code like this:
x = x_init + x_offt
...
x_offt = x - x_init
to reconstruct x_offt
from the ‘dirty’ sum values x
.
We do that because we’re fine with singular round-off errors until they don’t propagate to the next step, accumulating for millions of \(\xi\)-steps (‘Test 1’ simulations were conducted for up to 1.5 million steps).
… but not where it really matters¶
This way the only places where the separation should be preserved
is the path from prev.x_offt
to new_state.x_offt
.
Several x_offt
additions are performed and rolled back
at each \(\xi\)-step,
but only two kinds of them persist, both residing in move_smart()
:
x_offt += px / (gamma_m - pz) * xi_step_size
does no mixing with the coordinate values, andx = +2 * reflect_boundary - x
and the similar one for the left boundary only happen during particle reflection, which presumably happens rarely and only affects the particles that have already deviated at least several cells away from the initial position.
This way most particles won’t experience this kind of rounding issues with their coordinates. On the flip side, splitting the coordinates makes working with them quite unwieldy.
Design decisions¶
Codebase complexity¶
The code strives to be readable and vaguely understandable by a freshman student with only some basic background in plasma physics and numerical simulations, to the point of being comfortable with modifying it.
Given the complexity of the physical problem behind it, this goal is, sadly, unattainable, but the authors employ several avenues to get as close as possible:
- Abstaining from using advanced programming concepts. Cool things like aspect-oriented programming are neat, but keeping the code well under 1000 SLOC is even neater. The two classes we currently have is two classes over the ideal amount of them.
- Preferring less code over extensibility.
- Picking simpler code over performance tweaking.
- Choosing malleability over user convenience.
- Creating external modules or branches over exhaustive featureset.
- Not shying away from external dependencies or unpopular technologies, even if this means sacrificing the portability.
- Appointing a physics student with modest programming background as the maintainer and primary code reviewer.
Codebase size¶
LCODE 3D wasn’t always around 500 SLOC. In fact, even as it got rewritten from C to Cython to numba to numba.cuda to cupy, it peaked at around 5000 SLOC, twice. And we don’t even count its Fortran days.
In order to objectively curb the complexity, scope and malleability of the codebase, its size is limited to 1000 SLOC.
David Wheeler’s SLOCCount is used for obtaining the metric. Empty lines, docstrings and comments don’t count towards that limit.
Zero special boundary treatment¶
[Not allowing the particles to reach the outer cells of the simulation window] slightly modifies the physical problem itself, but, in return blesses us with the ability to forego special boundary checks during deposition, interpolation and field calculation, simplifying the code and boosting the performance.
Memory considerations¶
LCODE 3D is observed to consume roughly the same amount of host and GPU RAM, hovering around 500 MiB for a 641x641 grid, coarseness=3 and fineness=2.
The size of the arrays processed by LCODE 3D depends on these parameters.
Let’s label the field/densities grid size in a single direction as \(N\), coarse plasma grid size as \(N_c \approx \frac{N}{\text{coarseness}}\) and fine plasma grid size as \(N_f \approx N * \text{fineness}\).
With about the same amount of arrays in scope for each of these three sizes,
it is clear that the \(N_f^2\)-sized arrays would dominate the memory consumption.
Fortunately, the arrays that contain fine plasma characteristics would be transient and only used during the deposition,
while the interpolation indices and coefficients grouped under virt_params
can be reduced to 1D arrays by exploiting the \(x/y\) symmetry of the coarse/fine plasma grids.
This way LCODE 3D stores only \(N_c^2\)- and \(N^2\)-sized arrays, with \(N_f\)-sized ones barely taking up any space thanks to the being 1D.
Also, all previous attempts to micromanaged the GPU memory allocations have been scraped
in favor of blindly trusting the cupy
on-demand allocation.
Not only it is extremely convenient, it’s even more performant than our own solutions.
Integer xi steps¶
\(\xi\)-steps are integer for the purpose of bypassing float precision-based errors. The task of converting it into the \(\xi\)-coordinate is placed within the usage context.
Contributing¶
When contributing to this repository, please discuss the change you wish to make via email (team@lcode.info), issue tracker (https://github.com/lotov/lcode3d/issues), personal communication or any other method with our team.
The suggested followup workflow for the implementor would be:
- choose the most suitable parent branch;
- fork https://github.com/lotov/lcode3d or its fork;
- check it out locally;
- install dependencies (see
requirements.txt
); - verify that LCODE runs as-is;
- implement, test and commit changes;
- check that the code is still under 1000 SLOC;
- try to strip all the complex programming concepts and clever hacks;
- rebase it if the parent branch advances;
- submit a pull request;
- wait for it to be rebased-and-merged.
By submitting patches to this project, you agree them to be redistributed under the project’s license according to the normal forms and usages of the open-source community.
Useful links¶
Internal:
- Source code listing interlinked with documentation
- Example configuration file interlinked with documentation
- Index of LCODE functions and config parameters
External:
- LCODE 3D source: https://github.com/lotov/lcode3d
- LCODE 3D documentation: https://lcode3d.readthedocs.org
- LCODE team website: https://lcode.info
- LCODE team email: team@lcode.info