Dask¶
Dask is a flexible library for parallel computing in Python.
Dask is composed of two parts:
 Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
 “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to largerthanmemory or distributed environments. These parallel collections run on top of dynamic task schedulers.
Dask emphasizes the following virtues:
 Familiar: Provides parallelized NumPy array and Pandas DataFrame objects
 Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
 Native: Enables distributed computing in pure Python with access to the PyData stack.
 Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
 Scales up: Runs resiliently on clusters with 1000s of cores
 Scales down: Trivial to set up and run on a laptop in a single process
 Responsive: Designed with interactive computing in mind, it provides rapid feedback and diagnostics to aid humans
See the dask.distributed documentation (separate website) for more technical information on Dask’s distributed scheduler.
Familiar user interface¶
Dask DataFrame mimics Pandas  documentation
import pandas as pd import dask.dataframe as dd
df = pd.read_csv('20150101.csv') df = dd.read_csv('2015**.csv')
df.groupby(df.user_id).value.mean() df.groupby(df.user_id).value.mean().compute()
Dask Array mimics NumPy  documentation
import numpy as np import dask.array as da
f = h5py.File('myfile.hdf5') f = h5py.File('myfile.hdf5')
x = np.array(f['/smalldata']) x = da.from_array(f['/bigdata'],
chunks=(1000, 1000))
x  x.mean(axis=1) x  x.mean(axis=1).compute()
Dask Bag mimics iterators, Toolz, and PySpark  documentation
import dask.bag as db
b = db.read_text('2015**.json.gz').map(json.loads)
b.pluck('name').frequencies().topk(10, lambda pair: pair[1]).compute()
Dask Delayed mimics for loops and wraps custom code  documentation
from dask import delayed
L = []
for fn in filenames: # Use for loops to build up computation
data = delayed(load)(fn) # Delay execution of function
L.append(delayed(process)(data)) # Build connections between variables
result = delayed(summarize)(L)
result.compute()
The concurrent.futures interface provides general submission of custom tasks:  documentation
from dask.distributed import Client
client = Client('scheduler:port')
futures = []
for fn in filenames:
future = client.submit(load, fn)
futures.append(future)
summary = client.submit(summarize, futures)
summary.result()
Scales from laptops to clusters¶
Dask is convenient on a laptop. It installs trivially with
conda
or pip
and extends the size of convenient datasets from “fits in
memory” to “fits on disk”.
Dask can scale to a cluster of 100s of machines. It is resilient, elastic, data local, and low latency. For more information, see the documentation about the distributed scheduler.
This ease of transition between singlemachine to moderate cluster enables users to both start simple and grow when necessary.
Complex Algorithms¶
Dask represents parallel computations with task graphs. These
directed acyclic graphs may have arbitrary structure, which enables both
developers and users the freedom to build sophisticated algorithms and to
handle messy situations not easily managed by the map/filter/groupby
paradigm common in most data engineering frameworks.
We originally needed this complexity to build complex algorithms for ndimensional arrays but have found it to be equally valuable when dealing with messy situations in everyday problems.
Install Dask¶
You can install dask with conda
, with pip
, or by installing from source.
Conda¶
Dask is installed by default in Anaconda. You can update Dask using the conda command:
conda install dask
This installs Dask and all common dependencies, including Pandas and NumPy. Dask packages are maintained both on the default channel and on condaforge. Optionally, you can obtain a minimal Dask installation using the following command:
conda install daskcore
This will install a minimal set of dependencies required to run Dask similar to (but not exactly the same as) pip install dask
below.
Pip¶
You can install everything required for most common uses of Dask (arrays, dataframes, …) This installs both Dask and dependencies like NumPy, Pandas, and so on that are necessary for different workloads. This is often the right choice for Dask users:
pip install "dask[complete]" # Install everything
You can also install only the Dask library. Modules like dask.array
,
dask.dataframe
, dask.delayed
, or dask.distributed
won’t work until you also install NumPy,
Pandas, Toolz, or Tornado, respectively. This is common for downstream library
maintainers:
pip install dask # Install only core parts of dask
We also maintain other dependency sets for different subsets of functionality:
pip install "dask[array]" # Install requirements for dask array
pip install "dask[bag]" # Install requirements for dask bag
pip install "dask[dataframe]" # Install requirements for dask dataframe
pip install "dask[delayed]" # Install requirements for dask delayed
pip install "dask[distributed]" # Install requirements for distributed dask
We have these options so that users of the lightweight core Dask scheduler aren’t required to download the more exotic dependencies of the collections (Numpy, Pandas, Tornado, etc.).
Install from Source¶
To install Dask from source, clone the repository from github:
git clone https://github.com/dask/dask.git
cd dask
pip install .
You can also install all dependencies as well:
pip install ".[complete]"
You can view the list of all dependencies within the extras_require
field
of setup.py
.
Or do a developer install by using the e
flag:
pip install e .
Anaconda¶
Dask is included by default in the Anaconda distribution.
Optional dependencies¶
Specific functionality in Dask may require additional optional dependencies.
For example, reading from Amazon S3 requires s3fs
.
These optional dependencies and their minimum supported versions are listed below.
Dependency  Version  Description 

bokeh  >=1.0.0  Visualizing dask diagnostics 
cloudpickle  >=0.2.1  Pickling support for Python objects 
cityhash  Faster hashing of arrays  
distributed  >=2.0  Distributed computing in Python 
fastparquet  Storing and reading data from parquet files  
fsspec  >=0.6.0  Used for local, cluster and remote data IO 
gcsfs  >=0.4.0  Filesystem interface to Google Cloud Storage 
murmurhash  Faster hashing of arrays  
numpy  >=1.13.0  Required for dask.array 
pandas  >=0.21.0  Required for dask.dataframe 
partd  >=0.3.10  Concurrent appendable keyvalue storage 
psutil  Enables a more accurate CPU count  
pyarrow  >=0.14.0  Python library for Apache Arrow 
s3fs  >=0.4.0  Reading from Amazon S3 
sqlalchemy  Writing and reading from SQL databases  
toolz  >=0.7.3  Utility functions for iterators, functions, and dictionaries 
xxhash  Faster hashing of arrays 
Test¶
Test Dask with py.test
:
cd dask
py.test dask
Please be aware that installing Dask naively may not install all
requirements by default. Please read the pip
section above which discusses
requirements. You may choose to install the dask[complete]
version which includes
all dependencies for all collections. Alternatively, you may choose to test
only certain submodules depending on the libraries within your environment.
For example, to test only Dask core and Dask array we would run tests as
follows:
py.test dask/tests dask/array/tests
Setup¶
This page describes various ways to set up Dask on different hardware, either locally on your own machine or on a distributed cluster. If you are just getting started, then this page is unnecessary. Dask does not require any setup if you only want to use it on a single computer.
Dask has two families of task schedulers:
 Single machine scheduler: This scheduler provides basic features on a local process or thread pool. This scheduler was made first and is the default. It is simple and cheap to use. It can only be used on a single machine and does not scale.
 Distributed scheduler: This scheduler is more sophisticated. It offers more features, but also requires a bit more effort to set up. It can run locally or distributed across a cluster.
If you import Dask, set up a computation, and then call compute
, then you
will use the singlemachine scheduler by default. To use the dask.distributed
scheduler you must set up a Client
import dask.dataframe as dd
df = dd.read_csv(...)
df.x.sum().compute() # This uses the singlemachine scheduler by default
from dask.distributed import Client
client = Client(...) # Connect to distributed cluster and override default
df.x.sum().compute() # This now runs on the distributed system
Note that the newer dask.distributed
scheduler is often preferable, even on
single workstations. It contains many diagnostics and features not found in
the older singlemachine scheduler. The following pages explain in more detail
how to set up Dask on a variety of local and distributed hardware.
 Single Machine:
 Default Scheduler: The nosetup default. Uses local threads or processes for largerthanmemory processing
 dask.distributed: The sophistication of the newer system on a single machine. This provides more advanced features while still requiring almost no setup.
 Distributed computing:
 Manual Setup: The command line interface to set up
daskscheduler
anddaskworker
processes. Useful for IT or anyone building a deployment solution.  SSH: Use SSH to set up Dask across an unmanaged cluster.
 High Performance Computers: How to run Dask on traditional HPC environments using tools like MPI, or job schedulers like SLURM, SGE, TORQUE, LSF, and so on.
 Kubernetes: Deploy Dask with the popular Kubernetes resource manager using either Helm or a native deployment.
 YARN / Hadoop: Deploy Dask on YARN clusters, such as are found in traditional Hadoop installations.
 Python API (advanced): Create
Scheduler
andWorker
objects from Python as part of a distributed Tornado TCP application. This page is useful for those building custom frameworks.  Docker containers are available and may be useful in some of the solutions above.
 Cloud for current recommendations on how to deploy Dask and Jupyter on common cloud providers like Amazon, Google, or Microsoft Azure.
 Manual Setup: The command line interface to set up
SingleMachine Scheduler¶
The default Dask scheduler provides parallelism on a single machine by using either threads or processes. It is the default choice used by Dask because it requires no setup. You don’t need to make any choices or set anything up to use this scheduler. However, you do have a choice between threads and processes:
Threads: Use multiple threads in the same process. This option is good for numeric code that releases the GIL (like NumPy, Pandas, ScikitLearn, Numba, …) because data is free to share. This is the default scheduler for
dask.array
,dask.dataframe
, anddask.delayed
Processes: Send data to separate processes for processing. This option is good when operating on pure Python objects like strings or JSONlike dictionary data that holds onto the GIL, but not very good when operating on numeric data like Pandas DataFrames or NumPy arrays. Using processes avoids GIL issues, but can also result in a lot of interprocess communication, which can be slow. This is the default scheduler for
dask.bag
, and it is sometimes useful withdask.dataframe
Note that the
dask.distributed
scheduler is often a better choice when working with GILbound code. See dask.distributed on a single machineSinglethreaded: Execute computations in a single thread. This option provides no parallelism, but is useful when debugging or profiling. Turning your parallel execution into a sequential one can be a convenient option in many situations where you want to better understand what is going on
Selecting Threads, Processes, or Single Threaded¶
You can select between these options by specifying one of the following three
values to the scheduler=
keyword:
"threads"
: Uses a ThreadPool in the local process"processes"
: Uses a ProcessPool to spread work between processes"singlethreaded"
: Uses a forloop in the current thread
You can specify these options in any of the following ways:
When calling
.compute()
x.compute(scheduler='threads')
With a context manager
with dask.config.set(scheduler='threads'): x.compute() y.compute()
As a global setting
dask.config.set(scheduler='threads')
Single Machine: dask.distributed¶
The dask.distributed
scheduler works well on a single machine. It is sometimes
preferred over the default scheduler for the following reasons:
 It provides access to asynchronous API, notably Futures
 It provides a diagnostic dashboard that can provide valuable insight on performance and progress
 It handles data locality with more sophistication, and so can be more efficient than the multiprocessing scheduler on workloads that require multiple processes
You can create a dask.distributed
scheduler by importing and creating a
Client
with no arguments. This overrides whatever default was previously
set.
from dask.distributed import Client
client = Client()
You can navigate to http://localhost:8787/status to see the diagnostic dashboard if you have Bokeh installed.
Client¶
You can trivially set up a local cluster on your machine by instantiating a Dask Client with no arguments
from dask.distributed import Client
client = Client()
This sets up a scheduler in your local process and several processes running singlethreaded Workers.
If you want to run workers in your same process, you can pass the
processes=False
keyword argument.
client = Client(processes=False)
This is sometimes preferable if you want to avoid interworker communication and your computations release the GIL. This is common when primarily using NumPy or Dask Array.
LocalCluster¶
The Client()
call described above is shorthand for creating a LocalCluster
and then passing that to your client.
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
This is equivalent, but somewhat more explicit. You may want to look at the
keyword arguments available on LocalCluster
to understand the options available
to you on handling the mixture of threads and processes, like specifying explicit
ports, and so on.

class
distributed.deploy.local.
LocalCluster
(n_workers=None, threads_per_worker=None, processes=True, loop=None, start=None, host=None, ip=None, scheduler_port=0, silence_logs=30, dashboard_address=':8787', worker_dashboard_address=None, diagnostics_port=None, services=None, worker_services=None, service_kwargs=None, asynchronous=False, security=None, protocol=None, blocked_handlers=None, interface=None, worker_class=None, **worker_kwargs)¶ Create local Scheduler and Workers
This creates a “cluster” of a scheduler and workers running on the local machine.
Parameters:  n_workers: int
Number of workers to start
 processes: bool
Whether to use processes (True) or threads (False). Defaults to True
 threads_per_worker: int
Number of threads per each worker
 scheduler_port: int
Port of the scheduler. 8786 by default, use 0 to choose a random port
 silence_logs: logging level
Level of logs to print out to stdout.
logging.WARN
by default. Use a falsey value like False or None for no change. host: string
Host address on which the scheduler will listen, defaults to only localhost
 ip: string
Deprecated. See
host
above. dashboard_address: str
Address on which to listen for the Bokeh diagnostics server like ‘localhost:8787’ or ‘0.0.0.0:8787’. Defaults to ‘:8787’. Set to
None
to disable the dashboard. Use ‘:0’ for a random port. diagnostics_port: int
Deprecated. See dashboard_address.
 asynchronous: bool (False by default)
Set to True if using this cluster within async/await functions or within Tornado gen.coroutines. This should remain False for normal use.
 worker_kwargs: dict
Extra worker arguments, will be passed to the Worker constructor.
 blocked_handlers: List[str]
A list of strings specifying a blacklist of handlers to disallow on the Scheduler, like
['feed', 'run_function']
 service_kwargs: Dict[str, Dict]
Extra keywords to hand to the running services
 security : Security or bool, optional
Configures communication security in this cluster. Can be a security object, or True. If True, temporary selfsigned credentials will be created automatically.
 protocol: str (optional)
Protocol to use like
tcp://
,tls://
,inproc://
This defaults to sensible choice given other keyword arguments likeprocesses
andsecurity
 interface: str (optional)
Network interface to use. Defaults to lo/localhost
 worker_class: Worker
Worker class used to instantiate workers from.
Examples
>>> cluster = LocalCluster() # Create a local cluster with as many workers as cores # doctest: +SKIP >>> cluster # doctest: +SKIP LocalCluster("127.0.0.1:8786", workers=8, threads=8)
>>> c = Client(cluster) # connect to local cluster # doctest: +SKIP
Scale the cluster to three workers
>>> cluster.scale(3) # doctest: +SKIP
Pass extra keyword arguments to Bokeh
>>> LocalCluster(service_kwargs={'dashboard': {'prefix': '/foo'}}) # doctest: +SKIP
Command Line¶
This is the most fundamental way to deploy Dask on multiple machines. In production environments, this process is often automated by some other resource manager. Hence, it is rare that people need to follow these instructions explicitly. Instead, these instructions are useful for IT professionals who may want to set up automated services to deploy Dask within their institution.
A dask.distributed
network consists of one daskscheduler
process and
several daskworker
processes that connect to that scheduler. These are
normal Python processes that can be executed from the command line. We launch
the daskscheduler
executable in one process and the daskworker
executable in several processes, possibly on different machines.
To accomplish this, launch daskscheduler
on one node:
$ daskscheduler
Scheduler at: tcp://192.0.0.100:8786
Then, launch daskworker
on the rest of the nodes, providing the address to
the node that hosts daskscheduler
:
$ daskworker tcp://192.0.0.100:8786
Start worker at: tcp://192.0.0.1:12345
Registered to: tcp://192.0.0.100:8786
$ daskworker tcp://192.0.0.100:8786
Start worker at: tcp://192.0.0.2:40483
Registered to: tcp://192.0.0.100:8786
$ daskworker tcp://192.0.0.100:8786
Start worker at: tcp://192.0.0.3:27372
Registered to: tcp://192.0.0.100:8786
The workers connect to the scheduler, which then sets up a longrunning network connection back to the worker. The workers will learn the location of other workers from the scheduler.
Handling Ports¶
The scheduler and workers both need to accept TCP connections on an open port.
By default, the scheduler binds to port 8786
and the worker binds to a
random open port. If you are behind a firewall then you may have to open
particular ports or tell Dask to listen on particular ports with the port
and workerport
keywords.:
daskscheduler port 8000
daskworker dashboardaddress 8000 nannyport 8001
Nanny Processes¶
Dask workers are run within a nanny process that monitors the worker process and restarts it if necessary.
Diagnostic Web Servers¶
Additionally, Dask schedulers and workers host interactive diagnostic web
servers using Bokeh. These are optional, but
generally useful to users. The diagnostic server on the scheduler is
particularly valuable, and is served on port 8787
by default (configurable
with the dashboardaddress
keyword).
For more information about relevant ports, please take a look at the available command line options.
Automated Tools¶
There are various mechanisms to deploy these executables on a cluster, ranging from manually SSHing into all of the machines to more automated systems like SGE/SLURM/Torque or Yarn/Mesos. Additionally, cluster SSH tools exist to send the same commands to many machines. We recommend searching online for “cluster ssh” or “cssh”.
CLI Options¶
Note
The command line documentation here may differ depending on your installed
version. We recommend referring to the output of daskscheduler help
and daskworker help
.
daskscheduler¶
daskscheduler [OPTIONS] [PRELOAD_ARGV]...
Options

host
<host>
¶ URI, IP or hostname of this server

port
<port>
¶ Serving port

interface
<interface>
¶ Preferred network interface like ‘eth0’ or ‘ib0’

protocol
<protocol>
¶ Protocol like tcp, tls, or ucx

tlscafile
<tls_ca_file>
¶ CA cert(s) file for TLS (in PEM format)

tlscert
<tls_cert>
¶ certificate file for TLS (in PEM format)

tlskey
<tls_key>
¶ private key file for TLS (in PEM format)

bokehport
<bokeh_port>
¶ Deprecated. See –dashboardaddress

dashboardaddress
<dashboard_address>
¶ Address on which to listen for diagnostics dashboard [default: :8787]

dashboard
,
nodashboard
¶
Launch the Dashboard [default: –dashboard]

bokeh
,
nobokeh
¶
Deprecated. See –dashboard/–nodashboard.

show
,
noshow
¶
Show web UI [default: –show]

dashboardprefix
<dashboard_prefix>
¶ Prefix for the dashboard app

usexheaders
<use_xheaders>
¶ User xheaders in dashboard app for ssl termination in header [default: False]

pidfile
<pid_file>
¶ File to write the process PID

schedulerfile
<scheduler_file>
¶ File to write connection information. This may be a good way to share connection information if your cluster is on a shared network file system.

localdirectory
<local_directory>
¶ Directory to place scheduler files

preload
<preload>
¶ Module that should be loaded by the scheduler process like “foo.bar” or “/path/to/foo.py”.

idletimeout
<idle_timeout>
¶ Time of inactivity after which to kill the scheduler

version
¶
Show the version and exit.
Arguments

PRELOAD_ARGV
¶
Optional argument(s)
daskworker¶
daskworker [OPTIONS] [SCHEDULER] [PRELOAD_ARGV]...
Options

tlscafile
<tls_ca_file>
¶ CA cert(s) file for TLS (in PEM format)

tlscert
<tls_cert>
¶ certificate file for TLS (in PEM format)

tlskey
<tls_key>
¶ private key file for TLS (in PEM format)

workerport
<worker_port>
¶ Serving computation port, defaults to random

nannyport
<nanny_port>
¶ Serving nanny port, defaults to random

bokehport
<bokeh_port>
¶ Deprecated. See –dashboardaddress

dashboardaddress
<dashboard_address>
¶ Address on which to listen for diagnostics dashboard

dashboard
,
nodashboard
¶
Launch the Dashboard [default: –dashboard]

bokeh
,
nobokeh
¶
Deprecated. See –dashboard/–nodashboard.

listenaddress
<listen_address>
¶ The address to which the worker binds. Example: tcp://0.0.0.0:9000

contactaddress
<contact_address>
¶ The address the worker advertises to the scheduler for communication with it and other workers. Example: tcp://127.0.0.1:9000

host
<host>
¶ Serving host. Should be an ip address that is visible to the scheduler and other workers. See –listenaddress and –contactaddress if you need different listen and contact addresses. See –interface.

interface
<interface>
¶ Network interface like ‘eth0’ or ‘ib0’

protocol
<protocol>
¶ Protocol like tcp, tls, or ucx

nthreads
<nthreads>
¶ Number of threads per process.

nprocs
<nprocs>
¶ Number of worker processes to launch. [default: 1]

name
<name>
¶ A unique name for this worker like ‘worker1’. If used with –nprocs then the process number will be appended like name0, name1, name2, …

memorylimit
<memory_limit>
¶ Bytes of memory per process that the worker can use. This can be an integer (bytes), float (fraction of total system memory), string (like 5GB or 5000M), ‘auto’, or zero for no memory management [default: auto]

reconnect
,
noreconnect
¶
Reconnect to scheduler if disconnected [default: –reconnect]

nanny
,
nonanny
¶
Start workers in nanny process for management [default: –nanny]

pidfile
<pid_file>
¶ File to write the process PID

localdirectory
<local_directory>
¶ Directory to place worker files

resources
<resources>
¶ Resources for task constraints like “GPU=2 MEM=10e9”. Resources are applied separately to each worker process (only relevant when starting multiple worker processes with ‘–nprocs’).

schedulerfile
<scheduler_file>
¶ Filename to JSON encoded scheduler information. Use with daskscheduler –schedulerfile

deathtimeout
<death_timeout>
¶ Seconds to wait for a scheduler before closing

dashboardprefix
<dashboard_prefix>
¶ Prefix for the dashboard

lifetime
<lifetime>
¶ If provided, shut down the worker after this duration.

lifetimestagger
<lifetime_stagger>
¶ Random amount by which to stagger lifetime values [default: 0 seconds]

lifetimerestart
,
nolifetimerestart
¶
Whether or not to restart the worker after the lifetime lapses. This assumes that you are using the –lifetime and –nanny keywords [default: False]

preload
<preload>
¶ Module that should be loaded by each worker process like “foo.bar” or “/path/to/foo.py”

version
¶
Show the version and exit.
Arguments

SCHEDULER
¶
Optional argument

PRELOAD_ARGV
¶
Optional argument(s)
SSH¶
It is easy to set up Dask on informally managed networks of machines using SSH.
This can be done manually using SSH and the
Dask command line interface,
or automatically using either the SSHCluster
Python command or the
daskssh
command line tool. This document describes both of these options.
Python Interface¶

distributed.deploy.ssh.
SSHCluster
(hosts: List[str] = None, connect_options: dict = {}, worker_options: dict = {}, scheduler_options: dict = {}, worker_module: str = 'distributed.cli.dask_worker', **kwargs)¶ Deploy a Dask cluster using SSH
The SSHCluster function deploys a Dask Scheduler and Workers for you on a set of machine addresses that you provide. The first address will be used for the scheduler while the rest will be used for the workers (feel free to repeat the first hostname if you want to have the scheduler and worker cohabitate one machine.)
You may configure the scheduler and workers by passing
scheduler_options
andworker_options
dictionary keywords. See thedask.distributed.Scheduler
anddask.distributed.Worker
classes for details on the available options, but the defaults should work in most situations.You may configure your use of SSH itself using the
connect_options
keyword, which passes values to theasyncssh.connect
function. For more information on these see the documentation for theasyncssh
library https://asyncssh.readthedocs.io .Parameters:  hosts: List[str]
List of hostnames or addresses on which to launch our cluster The first will be used for the scheduler and the rest for workers
 connect_options:
Keywords to pass through to asyncssh.connect known_hosts: List[str] or None
The list of keys which will be used to validate the server host key presented during the SSH handshake. If this is not specified, the keys will be looked up in the file .ssh/known_hosts. If this is explicitly set to None, server host key validation will be disabled.
 worker_options:
Keywords to pass on to daskworker
 scheduler_options:
Keywords to pass on to daskscheduler
 worker_module:
Python module to call to start the worker
See also
dask.distributed.Scheduler
,dask.distributed.Worker
,asyncssh.connect
Examples
>>> from dask.distributed import Client, SSHCluster >>> cluster = SSHCluster( ... ["localhost", "localhost", "localhost", "localhost"], ... connect_options={"known_hosts": None}, ... worker_options={"nthreads": 2}, ... scheduler_options={"port": 0, "dashboard_address": ":8797"} ... ) >>> client = Client(cluster)
An example using a different worker module, in particular the
daskcudaworker
command from thedaskcuda
project.>>> from dask.distributed import Client, SSHCluster >>> cluster = SSHCluster( ... ["localhost", "hostwithgpus", "anothergpuhost"], ... connect_options={"known_hosts": None}, ... scheduler_options={"port": 0, "dashboard_address": ":8797"}, ... worker_module='dask_cuda.dask_cuda_worker') >>> client = Client(cluster)
Command Line¶
The convenience script daskssh
opens several SSH connections to your
target computers and initializes the network accordingly. You can
give it a list of hostnames or IP addresses:
$ daskssh 192.168.0.1 192.168.0.2 192.168.0.3 192.168.0.4
Or you can use normal UNIX grouping:
$ daskssh 192.168.0.{1,2,3,4}
Or you can specify a hostfile that includes a list of hosts:
$ cat hostfile.txt
192.168.0.1
192.168.0.2
192.168.0.3
192.168.0.4
$ daskssh hostfile hostfile.txt
The daskssh
utility depends on the paramiko
:
pip install paramiko
Note
The command line documentation here may differ depending on your installed
version. We recommend referring to the output of daskssh help
.
daskssh¶
Launch a distributed cluster over SSH. A ‘daskscheduler’ process will run on the first host specified in [HOSTNAMES] or in the hostfile (unless –scheduler is specified explicitly). One or more ‘daskworker’ processes will be run each host in [HOSTNAMES] or in the hostfile. Use command line flags to adjust how many daskworker process are run on each host (–nprocs) and how many cpus are used by each daskworker process (–nthreads).
daskssh [OPTIONS] [HOSTNAMES]...
Options

scheduler
<scheduler>
¶ Specify scheduler node. Defaults to first address.

schedulerport
<scheduler_port>
¶ Specify scheduler port number. [default: 8786]

nthreads
<nthreads>
¶ Number of threads per worker process. Defaults to number of cores divided by the number of processes per host.

nprocs
<nprocs>
¶ Number of worker processes per host. [default: 1]

hostfile
<hostfile>
¶ Textfile with hostnames/IP addresses

sshusername
<ssh_username>
¶ Username to use when establishing SSH connections.

sshport
<ssh_port>
¶ Port to use for SSH connections. [default: 22]

sshprivatekey
<ssh_private_key>
¶ Private key file to use for SSH connections.

nohost
¶
Do not pass the hostname to the worker.

logdirectory
<log_directory>
¶ Directory to use on all cluster nodes for the output of daskscheduler and daskworker commands.

remotepython
<remote_python>
¶ Path to Python on remote nodes.

memorylimit
<memory_limit>
¶ Bytes of memory that the worker can use. This can be an integer (bytes), float (fraction of total system memory), string (like 5GB or 5000M), ‘auto’, or zero for no memory management [default: auto]

workerport
<worker_port>
¶ Serving computation port, defaults to random

nannyport
<nanny_port>
¶ Serving nanny port, defaults to random

remotedaskworker
<remote_dask_worker>
¶ Worker to run. [default: distributed.cli.dask_worker]

version
¶
Show the version and exit.
Arguments

HOSTNAMES
¶
Optional argument(s)
High Performance Computers¶
Relevant Machines¶
This page includes instructions and guidelines when deploying Dask on high performance supercomputers commonly found in scientific and industry research labs. These systems commonly have the following attributes:
 Some mechanism to launch MPI applications or use job schedulers like SLURM, SGE, TORQUE, LSF, DRMAA, PBS, or others
 A shared network file system visible to all machines in the cluster
 A high performance network interconnect, such as Infiniband
 Little or no nodelocal storage
Where to start¶
Most of this page documents various ways and best practices to use Dask on an HPC cluster. This is technical and aimed both at users with some experience deploying Dask and also system administrators.
The preferred and simplest way to run Dask on HPC systems today both for new, experienced users or administrator is to use daskjobqueue.
However, daskjobqueue is slightly oriented toward interactive analysis usage, and it might be better to use tools like daskmpi in some routine batch production workloads.
Daskjobqueue and Daskdrmaa¶
The following projects provide easy highlevel access to Dask using resource managers that are commonly deployed on HPC systems:
 daskjobqueue for use with PBS, SLURM, LSF, SGE and other resource managers
 daskdrmaa for use with any DRMAA compliant resource manager
They provide interfaces that look like the following:
from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=36,
memory="100GB",
project='P48500028',
queue='premium',
interface='ib0',
walltime='02:00:00')
cluster.scale(100) # Start 100 workers in 100 jobs that match the description above
from dask.distributed import Client
client = Client(cluster) # Connect to that cluster
Daskjobqueue provides a lot of possibilities like adaptive dynamic scaling of workers, we recommend reading the daskjobqueue documentation first to get a basic system running and then returning to this documentation for finetuning if necessary.
Using MPI¶
Note
This section may not be necessary if you use a tool like daskjobqueue.
You can launch a Dask network using mpirun
or mpiexec
and the
daskmpi
command line executable.
mpirun np 4 daskmpi schedulerfile /home/$USER/scheduler.json
from dask.distributed import Client
client = Client(scheduler_file='/path/to/scheduler.json')
This depends on the mpi4py library. It only
uses MPI to start the Dask cluster and not for internode communication. MPI
implementations differ: the use of mpirun np 4
is specific to the
mpich
or openmpi
MPI implementation installed through conda and linked
to mpi4py.
conda install mpi4py
It is not necessary to use exactly this implementation, but you may want to
verify that your mpi4py
Python library is linked against the proper
mpirun/mpiexec
executable and that the flags used (like np 4
) are
correct for your system. The system administrator of your cluster should be
very familiar with these concerns and able to help.
In some setups, MPI processes are not allowed to fork other processes. In this
case, we recommend using nonanny
option in order to prevent dask from
using an additional nanny process to manage workers.
Run daskmpi help
to see more options for the daskmpi
command.
High Performance Network¶
Many HPC systems have both standard Ethernet networks as well as
highperformance networks capable of increased bandwidth. You can instruct
Dask to use the highperformance network interface by using the interface
keyword with the daskworker
, daskscheduler
, or daskmpi
commands or
the interface=
keyword with the daskjobqueue Cluster
objects:
mpirun np 4 daskmpi schedulerfile /home/$USER/scheduler.json interface ib0
In the code example above, we have assumed that your cluster has an Infiniband
network interface called ib0
. You can check this by asking your system
administrator or by inspecting the output of ifconfig
$ ifconfig
lo Link encap:Local Loopback # Localhost
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
eth0 Link encap:Ethernet HWaddr XX:XX:XX:XX:XX:XX # Ethernet
inet addr:192.168.0.101
...
ib0 Link encap:Infiniband # Fast InfiniBand
inet addr:172.42.0.101
https://stackoverflow.com/questions/43881157/howdoiuseaninfinibandnetworkwithdask
Local Storage¶
Users often exceed memory limits available to a specific Dask deployment. In normal operation, Dask spills excess data to disk, often to the default temporary directory.
However, in HPC systems this default temporary directory may point to an network file system (NFS) mount which can cause problems as Dask tries to read and write many small files. Beware, reading and writing many tiny files from many distributed processes is a good way to shut down a national supercomputer.
If available, it’s good practice to point Dask workers to local storage, or
hard drives that are physically on each node. Your IT administrators will be
able to point you to these locations. You can do this with the
localdirectory
or local_directory=
keyword in the daskworker
command:
daskmpi ... localdirectory /path/to/local/storage
or any of the other Dask Setup utilities, or by specifying the following configuration value:
temporarydirectory: /path/to/local/storage
However, not all HPC systems have local storage. If this is the case then you
may want to turn off Dask’s ability to spill to disk altogether. See this
page
for more information on Dask’s memory policies. Consider changing the
following values in your ~/.config/dask/distributed.yaml
file to disable
spilling data to disk:
distributed:
worker:
memory:
target: false # don't spill to disk
spill: false # don't spill to disk
pause: 0.80 # pause execution at 80% memory use
terminate: 0.95 # restart the worker at 95% use
This stops Dask workers from spilling to disk, and instead relies entirely on mechanisms to stop them from processing when they reach memory limits.
As a reminder, you can set the memory limit for a worker using the
memorylimit
keyword:
daskmpi ... memorylimit 10GB
Launch Many Small Jobs¶
Note
This section is not necessary if you use a tool like daskjobqueue.
HPC job schedulers are optimized for large monolithic jobs with many nodes that all need to run as a group at the same time. Dask jobs can be quite a bit more flexible: workers can come and go without strongly affecting the job. If we split our job into many smaller jobs, we can often get through the job scheduling queue much more quickly than a typical job. This is particularly valuable when we want to get started right away and interact with a Jupyter notebook session rather than waiting for hours for a suitable allocation block to become free.
So, to get a large cluster quickly, we recommend allocating a daskscheduler process on one node with a modest wall time (the intended time of your session) and then allocating many small singlenode daskworker jobs with shorter wall times (perhaps 30 minutes) that can easily squeeze into extra space in the job scheduler. As you need more computation, you can add more of these singlenode jobs or let them expire.
Use Dask to colaunch a Jupyter server¶
Dask can help you by launching other services alongside it. For example, you
can run a Jupyter notebook server on the machine running the daskscheduler
process with the following commands
from dask.distributed import Client
client = Client(scheduler_file='scheduler.json')
import socket
host = client.run_on_scheduler(socket.gethostname)
def start_jlab(dask_scheduler):
import subprocess
proc = subprocess.Popen(['/path/to/jupyter', 'lab', 'ip', host, 'nobrowser'])
dask_scheduler.jlab_proc = proc
client.run_on_scheduler(start_jlab)
Kubernetes¶
Kubernetes and Helm¶
It is easy to launch a Dask cluster and a Jupyter notebook server on cloud resources using Kubernetes and Helm.
This is particularly useful when you want to deploy a fresh Python environment on Cloud services like Amazon Web Services, Google Compute Engine, or Microsoft Azure.
If you already have Python environments running in a preexisting Kubernetes cluster, then you may prefer the Kubernetes native documentation, which is a bit lighter weight.
Launch Kubernetes Cluster¶
This document assumes that you have a Kubernetes cluster and Helm installed.
If this is not the case, then you might consider setting up a Kubernetes cluster on one of the common cloud providers like Google, Amazon, or Microsoft. We recommend the first part of the documentation in the guide Zero to JupyterHub that focuses on Kubernetes and Helm (you do not need to follow all of these instructions). Also, JupyterHub is not necessary to deploy Dask:
Alternatively, you may want to experiment with Kubernetes locally using Minikube.
Helm Install Dask¶
Dask maintains a Helm chart repository containing various charts for the Dask community https://helm.dask.org/ . You will need to add this to your known channels and update your local charts:
helm repo add dask https://helm.dask.org/
helm repo update
Now, you can launch Dask on your Kubernetes cluster using the Dask Helm chart:
helm install dask/dask
This deploys a daskscheduler
, several daskworker
processes, and
also an optional Jupyter server.
Verify Deployment¶
This might take a minute to deploy. You can check its status with
kubectl
:
kubectl get pods
kubectl get services
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
baldeeljupyter924045334twtxd 0/1 ContainerCreating 0 1m
baldeelscheduler3074430035cn1dt 1/1 Running 0 1m
baldeelworker3032746726202jt 1/1 Running 0 1m
baldeelworker3032746726b8nqq 1/1 Running 0 1m
baldeelworker3032746726d0chx 0/1 ContainerCreating 0 1m
$ kubectl get services
NAME TYPE CLUSTERIP EXTERNALIP PORT(S) AGE
baldeeljupyter LoadBalancer 10.11.247.201 35.226.183.149 80:30173/TCP 2m
baldeelscheduler LoadBalancer 10.11.245.241 35.202.201.129 8786:31166/TCP,80:31626/TCP 2m
kubernetes ClusterIP 10.11.240.1 <none> 443/TCP
48m
You can use the addresses under EXTERNALIP
to connect to your nowrunning
Jupyter and Dask systems.
Notice the name baldeel
. This is the name that Helm has given to your
particular deployment of Dask. You could, for example, have multiple
DaskandJupyter clusters running at once, and each would be given a different
name. Note that you will need to use this name to refer to your deployment in the future.
Additionally, you can list all active helm deployments with:
helm list
NAME REVISION UPDATED STATUS CHART NAMESPACE
baldeel 1 Wed Dec 6 11:19:54 2017 DEPLOYED dask0.1.0 default
Connect to Dask and Jupyter¶
When we ran kubectl get services
, we saw some externally visible IPs:
mrocklin@pangeo181919:~$ kubectl get services
NAME TYPE CLUSTERIP EXTERNALIP PORT(S) AGE
baldeeljupyter LoadBalancer 10.11.247.201 35.226.183.149 80:30173/TCP 2m
baldeelscheduler LoadBalancer 10.11.245.241 35.202.201.129 8786:31166/TCP,80:31626/TCP 2m
kubernetes ClusterIP 10.11.240.1 <none> 443/TCP 48m
We can navigate to these services from any web browser. Here, one is the Dask diagnostic
dashboard, and the other is the Jupyter server. You can log into the Jupyter
notebook server with the password, dask
.
You can create a notebook and create a Dask client from there. The
DASK_SCHEDULER_ADDRESS
environment variable has been populated with the
address of the Dask scheduler. This is available in Python in the config
dictionary.
>>> from dask.distributed import Client, config
>>> config['scheduleraddress']
'baldeelscheduler:8786'
Although you don’t need to use this address, the Dask client will find this variable automatically.
from dask.distributed import Client, config
client = Client()
Configure Environment¶
By default, the Helm deployment launches three workers using two cores each and a standard conda environment. We can customize this environment by creating a small yaml file that implements a subset of the values in the dask helm chart values.yaml file.
For example, we can increase the number of workers, and include extra conda and pip packages to install on the both the workers and Jupyter server (these two environments should be matched).
# config.yaml
worker:
replicas: 8
resources:
limits:
cpu: 2
memory: 7.5G
requests:
cpu: 2
memory: 7.5G
env:
 name: EXTRA_CONDA_PACKAGES
value: numba xarray c condaforge
 name: EXTRA_PIP_PACKAGES
value: s3fs daskml upgrade
# We want to keep the same packages on the worker and jupyter environments
jupyter:
enabled: true
env:
 name: EXTRA_CONDA_PACKAGES
value: numba xarray matplotlib c condaforge
 name: EXTRA_PIP_PACKAGES
value: s3fs daskml upgrade
This config file overrides the configuration for the number and size of workers and the conda and pip packages installed on the worker and Jupyter containers. In general, we will want to make sure that these two software environments match.
Update your deployment to use this configuration file. Note that you will not use helm install for this stage: that would create a new deployment on the same Kubernetes cluster. Instead, you will upgrade your existing deployment by using the current name:
helm upgrade baldeel dask/dask f config.yaml
This will update those containers that need to be updated. It may take a minute or so.
As a reminder, you can list the names of deployments you have using helm
list
Check status and logs¶
For standard issues, you should be able to see the worker status and logs using the
Dask dashboard (in particular, you can see the worker links from the info/
page).
However, if your workers aren’t starting, you can check the status of pods and
their logs with the following commands:
kubectl get pods
kubectl logs <PODNAME>
mrocklin@pangeo181919:~$ kubectl get pods
NAME READY STATUS RESTARTS AGE
baldeeljupyter3805078281n1qk2 1/1 Running 0 18m
baldeelscheduler3074430035cn1dt 1/1 Running 0 58m
baldeelworker19318819141q09p 1/1 Running 0 18m
baldeelworker1931881914856mm 1/1 Running 0 18m
baldeelworker19318819149lgzb 1/1 Running 0 18m
baldeelworker1931881914bdn2c 1/1 Running 0 16m
baldeelworker1931881914jq70m 1/1 Running 0 17m
baldeelworker1931881914qsgj7 1/1 Running 0 18m
baldeelworker1931881914s2phd 1/1 Running 0 17m
baldeelworker1931881914srmmg 1/1 Running 0 17m
mrocklin@pangeo181919:~$ kubectl logs baldeelworker1931881914856mm
EXTRA_CONDA_PACKAGES environment variable found. Installing.
Fetching package metadata ...........
Solving package specifications: .
Package plan for installation in environment /opt/conda/envs/dask:
The following NEW packages will be INSTALLED:
fasteners: 0.14.1py36_2 condaforge
monotonic: 1.3py36_0 condaforge
zarr: 2.1.4py36_0 condaforge
Proceed ([y]/n)?
monotonic1.3 100% ############################### Time: 0:00:00 11.16 MB/s
fasteners0.14 100% ############################### Time: 0:00:00 576.56 kB/s
...
Delete a Helm deployment¶
You can always delete a helm deployment using its name:
helm delete baldeel purge
Note that this does not destroy any clusters that you may have allocated on a Cloud service (you will need to delete those explicitly).
Avoid the Jupyter Server¶
Sometimes you do not need to run a Jupyter server alongside your Dask cluster.
jupyter:
enabled: false
Kubernetes Native¶
See external documentation on DaskKubernetes for more information.
Kubernetes is a popular system for deploying distributed applications on clusters, particularly in the cloud. You can use Kubernetes to launch Dask workers in the following two ways:
Helm: You can launch a Dask scheduler, several workers, and an optional Jupyter Notebook server on a Kubernetes easily using Helm
helm repo add dask https://helm.dask.org/ # add the Dask Helm chart repository helm repo update # get latest Helm charts helm install dask/dask # deploy standard Dask chart
This is a good choice if you want to do the following:
 Run a managed Dask cluster for a long period of time
 Also deploy a Jupyter server from which to run code
 Share the same Dask cluster between many automated services
 Try out Dask for the first time on a cloudbased system like Amazon, Google, or Microsoft Azure (see also our Cloud documentation)
Note
For more information, see Dask and Helm documentation.
Native: You can quickly deploy Dask workers on Kubernetes from within a Python script or interactive session using DaskKubernetes
from dask_kubernetes import KubeCluster cluster = KubeCluster.from_yaml('workertemplate.yaml') cluster.scale(20) # add 20 workers cluster.adapt() # or create and destroy workers dynamically based on workload from dask.distributed import Client client = Client(cluster)
This is a good choice if you want to do the following:
 Dynamically create a personal and ephemeral deployment for interactive use
 Allow many individuals the ability to launch their own custom dask deployments, rather than depend on a centralized system
 Quickly adapt Dask cluster size to the current workload
Note
For more information, see DaskKubernetes documentation.
You may also want to see the documentation on using Dask with Docker containers to help you manage your software environments on Kubernetes.
Python API (advanced)¶
In some rare cases, experts may want to create Scheduler
, Worker
, and
Nanny
objects explicitly in Python. This is often necessary when making
tools to automatically deploy Dask in custom settings.
It is more common to create a Local cluster with Client() on a single machine or use the Command Line Interface (CLI). New readers are recommended to start there.
If you do want to start Scheduler and Worker objects yourself you should be a
little familiar with async
/await
style Python syntax. These objects
are awaitable and are commonly used within async with
context managers.
Here are a few examples to show a few ways to start and finish things.
Full Example¶
Scheduler ([loop, delete_interval, …]) 
Dynamic distributed task scheduler 
Worker ([scheduler_ip, scheduler_port, …]) 
Worker node in a Dask distributed cluster 
Client ([address, loop, timeout, …]) 
Connect to and submit computation to a Dask cluster 
We first start with a comprehensive example of setting up a Scheduler, two Workers, and one Client in the same event loop, running a simple computation, and then cleaning everything up.
import asyncio
from dask.distributed import Scheduler, Worker, Client
async def f():
async with Scheduler() as s:
async with Worker(s.address) as w1, Worker(s.address) as w2:
async with Client(s.address, asynchronous=True) as client:
future = client.submit(lambda x: x + 1, 10)
result = await future
print(result)
asyncio.get_event_loop().run_until_complete(f())
Now we look at simpler examples that build up to this case.
Scheduler¶
Scheduler ([loop, delete_interval, …]) 
Dynamic distributed task scheduler 
We create scheduler by creating a Scheduler()
object, and then await
that object to wait for it to start up. We can then wait on the .finished
method to wait until it closes. In the meantime the scheduler will be active
managing the cluster..
import asyncio
from dask.distributed import Scheduler, Worker
async def f():
s = Scheduler() # scheduler created, but not yet running
s = await s # the scheduler is running
await s.finished() # wait until the scheduler closes
asyncio.get_event_loop().run_until_complete(f())
This program will run forever, or until some external process connects to the
scheduler and tells it to stop. If you want to close things yourself you can
close any Scheduler
, Worker
, Nanny
, or Client
class by awaiting
the .close
method:
await s.close()
Worker¶
Worker ([scheduler_ip, scheduler_port, …]) 
Worker node in a Dask distributed cluster 
The worker follows the same API. The only difference is that the worker needs to know the address of the scheduler.
import asyncio
from dask.distributed import Scheduler, Worker
async def f(scheduler_address):
w = await Worker(scheduler_address)
await w.finished()
asyncio.get_event_loop().run_until_complete(f("tcp://127.0.0.1:8786"))
Start many in one event loop¶
Scheduler ([loop, delete_interval, …]) 
Dynamic distributed task scheduler 
Worker ([scheduler_ip, scheduler_port, …]) 
Worker node in a Dask distributed cluster 
We can run as many of these objects as we like in the same event loop.
import asyncio
from dask.distributed import Scheduler, Worker
async def f():
s = await Scheduler()
w = await Worker(s.address)
await w.finished()
await s.finished()
asyncio.get_event_loop().run_until_complete(f())
Use Context Managers¶
We can also use async with
context managers to make sure that we clean up
properly. Here is the same example as from above:
import asyncio
from dask.distributed import Scheduler, Worker
async def f():
async with Scheduler() as s:
async with Worker(s.address) as w:
await w.finished()
await s.finished()
asyncio.get_event_loop().run_until_complete(f())
Alternatively, in the example below we also include a Client
, run a small
computation, and then allow things to clean up after that computation..
import asyncio
from dask.distributed import Scheduler, Worker, Client
async def f():
async with Scheduler() as s:
async with Worker(s.address) as w1, Worker(s.address) as w2:
async with Client(s.address, asynchronous=True) as client:
future = client.submit(lambda x: x + 1, 10)
result = await future
print(result)
asyncio.get_event_loop().run_until_complete(f())
This is equivalent to creating and awaiting
each server, and then calling
.close
on each as we leave the context.
In this example we don’t wait on s.finished()
, so this will terminate
relatively quickly. You could have called await s.finished()
though if you
wanted this to run forever.
Nanny¶
Nanny ([scheduler_ip, scheduler_port, …]) 
A process to manage worker processes 
Alternatively, we can replace Worker
with Nanny
if we want your workers
to be managed in a separate process. The Nanny
constructor follows the
same API. This allows workers to restart themselves in case of failure. Also,
it provides some additional monitoring, and is useful when coordinating many
workers that should live in different processes in order to avoid the GIL.
# w = await Worker(s.address)
w = await Nanny(s.address)
API¶
These classes have a variety of keyword arguments that you can use to control their behavior. See the API documentation below for more information.
Scheduler¶

class
distributed.
Scheduler
(loop=None, delete_interval='500ms', synchronize_worker_interval='60s', services=None, service_kwargs=None, allowed_failures=None, extensions=None, validate=False, scheduler_file=None, security=None, worker_ttl=None, idle_timeout=None, interface=None, host=None, port=0, protocol=None, dashboard_address=None, preload=None, preload_argv=(), **kwargs)¶ Dynamic distributed task scheduler
The scheduler tracks the current state of workers, data, and computations. The scheduler listens for events and responds by controlling workers appropriately. It continuously tries to use the workers to execute an ever growing dask graph.
All events are handled quickly, in linear time with respect to their input (which is often of constant size) and generally within a millisecond. To accomplish this the scheduler tracks a lot of state. Every operation maintains the consistency of this state.
The scheduler communicates with the outside world through Comm objects. It maintains a consistent and valid view of the world even when listening to several clients at once.
A Scheduler is typically started either with the
daskscheduler
executable:$ daskscheduler Scheduler started at 127.0.0.1:8786
Or within a LocalCluster a Client starts up without connection information:
>>> c = Client() >>> c.cluster.scheduler Scheduler(...)
Users typically do not interact with the scheduler directly but rather with the client object
Client
.State
The scheduler contains the following state variables. Each variable is listed along with what it stores and a brief description.
 tasks:
{task key: TaskState}
 Tasks currently known to the scheduler
 tasks:
 unrunnable:
{TaskState}
 Tasks in the “noworker” state
 unrunnable:
 workers:
{worker key: WorkerState}
 Workers currently connected to the scheduler
 workers:
 idle:
{WorkerState}
:  Set of workers that are not fully utilized
 idle:
 saturated:
{WorkerState}
:  Set of workers that are not overutilized
 saturated:
 host_info:
{hostname: dict}
:  Information about each worker host
 host_info:
 clients:
{client key: ClientState}
 Clients currently connected to the scheduler
 clients:
 services:
{str: port}
:  Other services running on this scheduler, like Bokeh
 services:
 loop:
IOLoop
:  The running Tornado IOLoop
 loop:
 client_comms:
{client key: Comm}
 For each client, a Comm object used to receive task requests and report task status updates.
 client_comms:
 stream_comms:
{worker key: Comm}
 For each worker, a Comm object from which we both accept stimuli and report results
 stream_comms:
 task_duration:
{keyprefix: time}
 Time we expect certain functions to take, e.g.
{'sum': 0.25}
 task_duration:

adaptive_target
(comm=None, target_duration='5s')¶ Desired number of workers based on the current workload
This looks at the current running tasks and memory use, and returns a number of desired workers. This is often used by adaptive scheduling.
Parameters:  target_duration: str
A desired duration of time for computations to take. This affects how rapidly the scheduler will ask to scale.
See also

add_client
(comm, client=None)¶ Add client to network
We listen to all future messages from this Comm.

add_keys
(comm=None, worker=None, keys=())¶ Learn that a worker has certain keys
This should not be used in practice and is mostly here for legacy reasons. However, it is sent by workers from time to time.

add_plugin
(plugin=None, idempotent=False, **kwargs)¶ Add external plugin to scheduler
See https://distributed.readthedocs.io/en/latest/plugins.html

add_worker
(comm=None, address=None, keys=(), nthreads=None, name=None, resolve_address=True, nbytes=None, types=None, now=None, resources=None, host_info=None, memory_limit=None, metrics=None, pid=0, services=None, local_directory=None, nanny=None, extra=None)¶ Add a new worker to the cluster

broadcast
(comm=None, msg=None, workers=None, hosts=None, nanny=False, serializers=None)¶ Broadcast message to workers, return all results

cancel_key
(key, client, retries=5, force=False)¶ Cancel a particular key and all dependents

check_idle_saturated
(ws, occ=None)¶ Update the status of the idle and saturated state
The scheduler keeps track of workers that are ..
 Saturated: have enough work to stay busy
 Idle: do not have enough work to stay busy
They are considered saturated if they both have enough tasks to occupy all of their threads, and if the expected runtime of those tasks is large enough.
This is useful for load balancing and adaptivity.

client_heartbeat
(client=None)¶ Handle heartbeats from Client

client_releases_keys
(keys=None, client=None)¶ Remove keys from client desired list

close
(comm=None, fast=False, close_workers=False)¶ Send cleanup signal to all coroutines then wait until finished
See also
Scheduler.cleanup

close_worker
(stream=None, worker=None, safe=None)¶ Remove a worker from the cluster
This both removes the worker from our local state and also sends a signal to the worker to shut down. This works regardless of whether or not the worker has a nanny process restarting it

coerce_address
(addr, resolve=True)¶ Coerce possible input addresses to canonical form. resolve can be disabled for testing with fake hostnames.
Handles strings, tuples, or aliases.

coerce_hostname
(host)¶ Coerce the hostname of a worker.

decide_worker
(ts)¶ Decide on a worker for task ts. Return a WorkerState.

feed
(comm, function=None, setup=None, teardown=None, interval='1s', **kwargs)¶ Provides a data Comm to external requester
Caution: this runs arbitrary Python code on the scheduler. This should eventually be phased out. It is mostly used by diagnostics.

gather
(comm=None, keys=None, serializers=None)¶ Collect data in from workers

get_comm_cost
(ts, ws)¶ Get the estimated communication cost (in s.) to compute the task on the given worker.

get_task_duration
(ts, default=0.5)¶ Get the estimated computation cost of the given task (not including any communication cost).

get_worker_service_addr
(worker, service_name, protocol=False)¶ Get the (host, port) address of the named service on the worker. Returns None if the service doesn’t exist.
Parameters:  worker : address
 service_name : str
Common services include ‘bokeh’ and ‘nanny’
 protocol : boolean
Whether or not to include a full address with protocol (True) or just a (host, port) pair

handle_long_running
(key=None, worker=None, compute_duration=None)¶ A task has seceded from the thread pool
We stop the task from being stolen in the future, and change task duration accounting as if the task has stopped.

handle_worker
(comm=None, worker=None)¶ Listen to responses from a single worker
This is the main loop for schedulerworker interaction
See also
Scheduler.handle_client
 Equivalent coroutine for clients

identity
(comm=None)¶ Basic information about ourselves and our cluster

proxy
(comm=None, msg=None, worker=None, serializers=None)¶ Proxy a communication through the scheduler to some other worker

rebalance
(comm=None, keys=None, workers=None)¶ Rebalance keys so that each worker stores roughly equal bytes
Policy
This orders the workers by what fraction of bytes of the existing keys they have. It walks down this list from mosttoleast. At each worker it sends the largest results it can find and sends them to the least occupied worker until either the sender or the recipient are at the average expected load.

reevaluate_occupancy
(worker_index=0)¶ Periodically reassess task duration time
The expected duration of a task can change over time. Unfortunately we don’t have a good constanttime way to propagate the effects of these changes out to the summaries that they affect, like the total expected runtime of each of the workers, or what tasks are stealable.
In this coroutine we walk through all of the workers and realign their estimates with the current state of tasks. We do this periodically rather than at every transition, and we only do it if the scheduler process isn’t under load (using psutil.Process.cpu_percent()). This lets us avoid this fringe optimization when we have better things to think about.

register_worker_plugin
(comm, plugin, name=None)¶ Registers a setup function, and call it on every worker

remove_client
(client=None)¶ Remove client from network

remove_plugin
(plugin)¶ Remove external plugin from scheduler

remove_worker
(comm=None, address=None, safe=False, close=True)¶ Remove worker from cluster
We do this when a worker reports that it plans to leave or when it appears to be unresponsive. This may send its tasks back to a released state.

replicate
(comm=None, keys=None, n=None, workers=None, branching_factor=2, delete=True)¶ Replicate data throughout cluster
This performs a tree copy of the data throughout the network individually on each piece of data.
Parameters:  keys: Iterable
list of keys to replicate
 n: int
Number of replications we expect to see within the cluster
 branching_factor: int, optional
The number of workers that can copy data in each generation. The larger the branching factor, the more data we copy in a single step, but the more a given worker risks being swamped by data requests.
See also

report
(msg, ts=None, client=None)¶ Publish updates to all listening Queues and Comms
If the message contains a key then we only send the message to those comms that care about the key.

reschedule
(key=None, worker=None)¶ Reschedule a task
Things may have shifted and this task may now be better suited to run elsewhere

restart
(client=None, timeout=3)¶ Restart all workers. Reset local state.

retire_workers
(comm=None, workers=None, remove=True, close_workers=False, names=None, **kwargs)¶ Gracefully retire workers from cluster
Parameters:  workers: list (optional)
List of worker addresses to retire. If not provided we call
workers_to_close
which finds a good set workers_names: list (optional)
List of worker names to retire.
 remove: bool (defaults to True)
Whether or not to remove the worker metadata immediately or else wait for the worker to contact us
 close_workers: bool (defaults to False)
Whether or not to actually close the worker explicitly from here. Otherwise we expect some external job scheduler to finish off the worker.
 **kwargs: dict
Extra options to pass to workers_to_close to determine which workers we should drop
Returns:  Dictionary mapping worker ID/address to dictionary of information about
 that worker for each retired worker.
See also

run_function
(stream, function, args=(), kwargs={}, wait=True)¶ Run a function within this process
See also

scatter
(comm=None, data=None, workers=None, client=None, broadcast=False, timeout=2)¶ Send data out to workers
See also

send_task_to_worker
(worker, key)¶ Send a single computational task to a worker

start
()¶ Clear out old state and restart all running coroutines

start_ipython
(comm=None)¶ Start an IPython kernel
Returns Jupyter connection info dictionary.

stimulus_cancel
(comm, keys=None, client=None, force=False)¶ Stop execution on a list of keys

stimulus_missing_data
(cause=None, key=None, worker=None, ensure=True, **kwargs)¶ Mark that certain keys have gone missing. Recover.

stimulus_task_erred
(key=None, worker=None, exception=None, traceback=None, **kwargs)¶ Mark that a task has erred on a particular worker

stimulus_task_finished
(key=None, worker=None, **kwargs)¶ Mark that a task has finished execution on a particular worker

story
(*keys)¶ Get all transitions that touch one of the input keys

transition
(key, finish, *args, **kwargs)¶ Transition a key from its current state to the finish state
Returns:  Dictionary of recommendations for future transitions
See also
Scheduler.transitions
 transitive version of this function
Examples
>>> self.transition('x', 'waiting') {'x': 'processing'}

transition_story
(*keys)¶ Get all transitions that touch one of the input keys

transitions
(recommendations)¶ Process transitions until none are left
This includes feedback from previous transitions and continues until we reach a steady state

update_data
(comm=None, who_has=None, nbytes=None, client=None, serializers=None)¶ Learn that new data has entered the network from an external source
See also
Scheduler.mark_key_in_memory

update_graph
(client=None, tasks=None, keys=None, dependencies=None, restrictions=None, priority=None, loose_restrictions=None, resources=None, submitting_task=None, retries=None, user_priority=0, actors=None, fifo_timeout=0)¶ Add new computations to the internal dask graph
This happens whenever the Client calls submit, map, get, or compute.

valid_workers
(ts)¶ Return set of currently valid workers for key
If all workers are valid then this returns
True
. This checks tracks the following state: worker_restrictions
 host_restrictions
 resource_restrictions

worker_objective
(ts, ws)¶ Objective function to determine which worker should get the task
Minimize expected start time. If a tie then break with data storage.

worker_send
(worker, msg)¶ Send message to worker
This also handles connection failures by adding a callback to remove the worker on the next cycle.

workers_list
(workers)¶ List of qualifying workers
Takes a list of worker addresses or hostnames. Returns a list of all worker addresses that match

workers_to_close
(comm=None, memory_ratio=None, n=None, key=None, minimum=None, target=None, attribute='address')¶ Find workers that we can close with low cost
This returns a list of workers that are good candidates to retire. These workers are not running anything and are storing relatively little data relative to their peers. If all workers are idle then we still maintain enough workers to have enough RAM to store our data, with a comfortable buffer.
This is for use with systems like
distributed.deploy.adaptive
.Parameters:  memory_factor: Number
Amount of extra space we want to have for our stored data. Defaults two 2, or that we want to have twice as much memory as we currently have data.
 n: int
Number of workers to close
 minimum: int
Minimum number of workers to keep around
 key: Callable(WorkerState)
An optional callable mapping a WorkerState object to a group affiliation. Groups will be closed together. This is useful when closing workers must be done collectively, such as by hostname.
 target: int
Target number of workers to have after we close
 attribute : str
The attribute of the WorkerState object to return, like “address” or “name”. Defaults to “address”.
Returns:  to_close: list of worker addresses that are OK to close
See also
Examples
>>> scheduler.workers_to_close() ['tcp://192.168.0.1:1234', 'tcp://192.168.0.2:1234']
Group workers by hostname prior to closing
>>> scheduler.workers_to_close(key=lambda ws: ws.host) ['tcp://192.168.0.1:1234', 'tcp://192.168.0.1:4567']
Remove two workers
>>> scheduler.workers_to_close(n=2)
Keep enough workers to have twice as much memory as we we need.
>>> scheduler.workers_to_close(memory_ratio=2)
Worker¶

class
distributed.
Worker
(scheduler_ip=None, scheduler_port=None, scheduler_file=None, ncores=None, nthreads=None, loop=None, local_dir=None, local_directory=None, services=None, service_ports=None, service_kwargs=None, name=None, reconnect=True, memory_limit='auto', executor=None, resources=None, silence_logs=None, death_timeout=None, preload=None, preload_argv=None, security=None, contact_address=None, memory_monitor_interval='200ms', extensions=None, metrics={}, startup_information={}, data=None, interface=None, host=None, port=None, protocol=None, dashboard_address=None, nanny=None, plugins=(), low_level_profiler=False, validate=False, profile_cycle_interval=None, lifetime=None, lifetime_stagger=None, lifetime_restart=None, **kwargs)¶ Worker node in a Dask distributed cluster
Workers perform two functions:
 Serve data from a local dictionary
 Perform computation on that data and on data from peers
Workers keep the scheduler informed of their data and use that scheduler to gather data from other workers when necessary to perform a computation.
You can start a worker with the
daskworker
command line application:$ daskworker schedulerip:port
Use the
help
flag to see more options:$ daskworker help
The rest of this docstring is about the internal state the the worker uses to manage and track internal computations.
State
Informational State
These attributes don’t change significantly during execution.
 nthreads:
int
:  Number of nthreads used by this worker process
 nthreads:
 executor:
concurrent.futures.ThreadPoolExecutor
:  Executor used to perform computation
 executor:
 local_directory:
path
:  Path on local machine to store temporary files
 local_directory:
 scheduler:
rpc
:  Location of scheduler. See
.ip/.port
attributes.
 scheduler:
 name:
string
:  Alias
 name:
 services:
{str: Server}
:  Auxiliary web servers running on this worker
 services:
 service_ports:
{str: port}
:  total_out_connections:
int
 The maximum number of concurrent outgoing requests for data
 total_out_connections:
 total_in_connections:
int
 The maximum number of concurrent incoming requests for data
 total_in_connections:
 total_comm_nbytes:
int
 batched_stream:
BatchedSend
 A batched stream along which we communicate to the scheduler
 batched_stream:
 log:
[(message)]
 A structured and queryable log. See
Worker.story
 log:
Volatile State
This attributes track the progress of tasks that this worker is trying to complete. In the descriptions below a
key
is the name of a task that we want to compute anddep
is the name of a piece of dependent data that we want to collect from others. data:
{key: object}
:  Prefer using the host attribute instead of this, unless memory_limit and at least one of memory_target_fraction or memory_spill_fraction values are defined, in that case, this attribute is a zict.Buffer, from which information on LRU cache can be queried.
 data:
 data.memory:
{key: object}
:  Dictionary mapping keys to actual values stored in memory. Only available if condition for data being a zict.Buffer is met.
 data.memory:
 data.disk:
{key: object}
:  Dictionary mapping keys to actual values stored on disk. Only available if condition for data being a zict.Buffer is met.
 data.disk:
 task_state:
{key: string}
:  The state of all tasks that the scheduler has asked us to compute. Valid states include waiting, constrained, executing, memory, erred
 task_state:
 tasks:
{key: dict}
 The function, args, kwargs of a task. We run this when appropriate
 tasks:
 dependencies:
{key: {deps}}
 The data needed by this key to run
 dependencies:
 dependents:
{dep: {keys}}
 The keys that use this dependency
 dependents:
 data_needed: deque(keys)
 The keys whose data we still lack, arranged in a deque
 waiting_for_data:
{kep: {deps}}
 A dynamic verion of dependencies. All dependencies that we still don’t have for a particular key.
 waiting_for_data:
 ready: [keys]
 Keys that are ready to run. Stored in a LIFO stack
 constrained: [keys]
 Keys for which we have the data to run, but are waiting on abstract resources like GPUs. Stored in a FIFO deque
 executing: {keys}
 Keys that are currently executing
 executed_count: int
 A number of tasks that this worker has run in its lifetime
 long_running: {keys}
 A set of keys of tasks that are running and have started their own longrunning clients.
 dep_state:
{dep: string}
:  The state of all dependencies required by our tasks Valid states include waiting, flight, and memory
 dep_state:
 who_has:
{dep: {worker}}
 Workers that we believe have this data
 who_has:
 has_what:
{worker: {deps}}
 The data that we care about that we think a worker has
 has_what:
 pending_data_per_worker:
{worker: [dep]}
 The data on each worker that we still want, prioritized as a deque
 pending_data_per_worker:
 in_flight_tasks:
{task: worker}
 All dependencies that are coming to us in current peertopeer connections and the workers from which they are coming.
 in_flight_tasks:
 in_flight_workers:
{worker: {task}}
 The workers from which we are currently gathering data and the dependencies we expect from those connections
 in_flight_workers:
 comm_bytes:
int
 The total number of bytes in flight
 comm_bytes:
 suspicious_deps:
{dep: int}
 The number of times a dependency has not been where we expected it
 suspicious_deps:
 nbytes:
{key: int}
 The size of a particular piece of data
 nbytes:
 types:
{key: type}
 The type of a particular piece of data
 types:
 threads:
{key: int}
 The ID of the thread on which the task ran
 threads:
 active_threads:
{int: key}
 The keys currently running on active threads
 active_threads:
 exceptions:
{key: exception}
 The exception caused by running a task if it erred
 exceptions:
 tracebacks:
{key: traceback}
 The exception caused by running a task if it erred
 tracebacks:
 startstops:
{key: [(str, float, float)]}
 Log of transfer, load, and compute times for a task
 startstops:
 priorities:
{key: tuple}
 The priority of a key given by the scheduler. Determines run order.
 priorities:
 durations:
{key: float}
 Expected duration of a task
 durations:
 resource_restrictions:
{key: {str: number}}
 Abstract resources required to run a task
 resource_restrictions:
Parameters:  scheduler_ip: str
 scheduler_port: int
 ip: str, optional
 data: MutableMapping, type, None
The object to use for storage, builds a diskbacked LRU dict by default
 nthreads: int, optional
 loop: tornado.ioloop.IOLoop
 local_directory: str, optional
Directory where we place local resources
 name: str, optional
 memory_limit: int, float, string
Number of bytes of memory that this worker should use. Set to zero for no limit. Set to ‘auto’ to calculate as system.MEMORY_LIMIT * min(1, nthreads / total_cores) Use strings or numbers like 5GB or 5e9
 memory_target_fraction: float
Fraction of memory to try to stay beneath
 memory_spill_fraction: float
Fraction of memory at which we start spilling to disk
 memory_pause_fraction: float
Fraction of memory at which we stop running new tasks
 executor: concurrent.futures.Executor
 resources: dict
Resources that this worker has like
{'GPU': 2}
 nanny: str
Address on which to contact nanny, if it exists
 lifetime: str
Amount of time like “1 hour” after which we gracefully shut down the worker. This defaults to None, meaning no explicit shutdown time.
 lifetime_stagger: str
Amount of time like “5 minutes” to stagger the lifetime value The actual lifetime will be selected uniformly at random between lifetime +/ lifetime_stagger
 lifetime_restart: bool
Whether or not to restart a worker after it has reached its lifetime Default False
See also
distributed.scheduler.Scheduler
,distributed.nanny.Nanny
Examples
Use the command line to start a worker:
$ daskscheduler Start scheduler at 127.0.0.1:8786 $ daskworker 127.0.0.1:8786 Start worker at: 127.0.0.1:1234 Registered with scheduler at: 127.0.0.1:8786

close_gracefully
()¶ Gracefully shut down a worker
This first informs the scheduler that we’re shutting down, and asks it to move our data elsewhere. Afterwards, we close as normal

executor_submit
(key, function, args=(), kwargs=None, executor=None)¶ Safely run function in thread pool executor
We’ve run into issues running concurrent.future futures within tornado. Apparently it’s advantageous to use timeouts and periodic callbacks to ensure things run smoothly. This can get tricky, so we pull it off into an separate method.

get_current_task
()¶ Get the key of the task we are currently running
This only makes sense to run within a task
See also
get_worker
Examples
>>> from dask.distributed import get_worker >>> def f(): ... return get_worker().get_current_task()
>>> future = client.submit(f) # doctest: +SKIP >>> future.result() # doctest: +SKIP 'f1234'

local_dir
¶ For API compatibility with Nanny

memory_monitor
()¶ Track this process’s memory usage and act accordingly
If we rise above 70% memory use, start dumping data to disk.
If we rise above 80% memory use, stop execution of new tasks

start_ipython
(comm)¶ Start an IPython kernel
Returns Jupyter connection info dictionary.

trigger_profile
()¶ Get a frame from all actively computing threads
Merge these frames into existing profile counts

worker_address
¶ For API compatibility with Nanny
Nanny¶

class
distributed.
Nanny
(scheduler_ip=None, scheduler_port=None, scheduler_file=None, worker_port=0, nthreads=None, ncores=None, loop=None, local_dir=None, local_directory='daskworkerspace', services=None, name=None, memory_limit='auto', reconnect=True, validate=False, quiet=False, resources=None, silence_logs=None, death_timeout=None, preload=None, preload_argv=None, security=None, contact_address=None, listen_address=None, worker_class=None, env=None, interface=None, host=None, port=None, protocol=None, config=None, **worker_kwargs)¶ A process to manage worker processes
The nanny spins up Worker processes, watches then, and kills or restarts them as necessary. It is necessary if you want to use the
Client.restart
method, or to restart the worker automatically if it gets to the terminate fractiom of its memory limit.The parameters for the Nanny are mostly the same as those for the Worker.
See also

close
(comm=None, timeout=5, report=None)¶ Close the worker process, stop all comms.

close_gracefully
(comm=None)¶ A signal that we shouldn’t try to restart workers if they go away
This is used as part of the cluster shutdown process.

instantiate
(comm=None)¶ Start a local worker process
Blocks until the process is up and the scheduler is properly informed

kill
(comm=None, timeout=2)¶ Kill the local worker process
Blocks until both the process is down and the scheduler is properly informed

local_dir
¶ For API compatibility with Nanny

memory_monitor
()¶ Track worker’s memory. Restart if it goes above terminate fraction

start
()¶ Start nanny, start local process, start watching

Cloud Deployments¶
There are a variety of ways to deploy Dask on cloud providers. Cloud providers provide managed services, like Kubernetes, Yarn, or custom APIs with which Dask can connect easily. You may want to consider the following options:
A managed Kubernetes service and Dask’s Kubernetes and Helm integraation.
A managed Yarn service, like Amazon EMR or Google Cloud DataProc and DaskYarn.
Specific documentation for the popular Amazon EMR service can be found here
Vendor specific services, like Amazon ECS, and Dask Cloud Provider
Data Access¶
You may want to install additional libraries in your Jupyter and worker images to access the object stores of each cloud:
Historical Libraries¶
Dask previously maintained libraries for deploying Dask on Amazon’s EC2 and Google GKE. Due to sporadic interest, and churn both within the Dask library and EC2 itself, these were not well maintained. They have since been deprecated in favor of the Kubernetes and Helm solution.
Adaptive Deployments¶
Motivation¶
Most Dask deployments are static with a single scheduler and a fixed number of workers. This results in predictable behavior, but is wasteful of resources in two situations:
 The user may not be using the cluster, or perhaps they are busy interpreting a recent result or plot, and so the workers sit idly, taking up valuable shared resources from other potential users
 The user may be very active, and is limited by their original allocation.
Particularly efficient users may learn to manually add and remove workers during their session, but this is rare. Instead, we would like the size of a Dask cluster to match the computational needs at any given time. This is the goal of the adaptive deployments discussed in this document. These are particularly helpful for interactive workloads, which are characterized by long periods of inactivity interrupted with short bursts of heavy activity. Adaptive deployments can result in both faster analyses that give users much more power, but with much less pressure on computational resources.
Adaptive¶
To make setting up adaptive deployments easy, some Dask deployment solutions
offer an .adapt()
method. Here is an example with
dask_kubernetes.KubeCluster.
from dask_kubernetes import KubeCluster
cluster = KubeCluster()
cluster.adapt(minimum=0, maximum=100) # scale between 0 and 100 workers
For more keyword options, see the Adaptive class below:
Adaptive ([cluster, interval, minimum, …]) 
Adaptively allocate workers based on scheduler load. 
Dependence on a Resource Manager¶
The Dask scheduler does not know how to launch workers on its own. Instead, it relies on an external resource scheduler like Kubernetes above, or Yarn, SGE, SLURM, Mesos, or some other inhouse system (see setup documentation for options). In order to use adaptive deployments, you must provide some mechanism for the scheduler to launch new workers. Typically, this is done by using one of the solutions listed in the setup documentation, or by subclassing from the Cluster superclass and implementing that API.
Cluster (asynchronous) 
Superclass for cluster objects 
Scaling Heuristics¶
The Dask scheduler tracks a variety of information that is useful to correctly allocate the number of workers:
 The historical runtime of every function and task that it has seen, and all of the functions that it is currently able to run for users
 The amount of memory used and available on each worker
 Which workers are idle or saturated for various reasons, like the presence of specialized hardware
From these, it is able to determine a target number of workers by dividing the
cumulative expected runtime of all pending tasks by the target_duration
parameter (defaults to five seconds). This number of workers serves as a
baseline request for the resource manager. This number can be altered for a
variety of reasons:
 If the cluster needs more memory, then it will choose either the target number of workers or twice the current number of workers (whichever is larger)
 If the target is outside of the range of the minimum and maximum values, then it is clipped to fit within that range
Additionally, when scaling down, Dask preferentially chooses those workers that
are idle and have the least data in memory. It moves that data to other
machines before retiring the worker. To avoid rapid cycling of the cluster up
and down in size, we only retire a worker after a few cycles have gone by where
it has consistently been a good idea to retire it (controlled by the
wait_count
and interval
parameters).
API¶

class
distributed.deploy.
Adaptive
(cluster=None, interval='1s', minimum=0, maximum=inf, wait_count=3, target_duration='5s', worker_key=None, **kwargs)¶ Adaptively allocate workers based on scheduler load. A superclass.
Contains logic to dynamically resize a Dask cluster based on current use. This class needs to be paired with a system that can create and destroy Dask workers using a cluster resource manager. Typically it is built into already existing solutions, rather than used directly by users. It is most commonly used from the
.adapt(...)
method of various Dask cluster classes.Parameters:  cluster: object
Must have scale and scale_down methods/coroutines
 interval : timedelta or str, default “1000 ms”
Milliseconds between checks
 wait_count: int, default 3
Number of consecutive times that a worker should be suggested for removal before we remove it.
 target_duration: timedelta or str, default “5s”
Amount of time we want a computation to take. This affects how aggressively we scale up.
 worker_key: Callable[WorkerState]
Function to group workers together when scaling down See Scheduler.workers_to_close for more information
 minimum: int
Minimum number of workers to keep around
 maximum: int
Maximum number of workers to keep around
 **kwargs:
Extra parameters to pass to Scheduler.workers_to_close
Notes
Subclasses can override
Adaptive.should_scale_up()
andAdaptive.workers_to_close()
to control when the cluster should be resized. The default implementation checks if there are too many tasks per worker or too little memory available (seeAdaptive.needs_cpu()
andAdaptive.needs_memory()
).Examples
This is commonly used from existing Dask classes, like KubeCluster
>>> from dask_kubernetes import KubeCluster >>> cluster = KubeCluster() >>> cluster.adapt(minimum=10, maximum=100)
Alternatively you can use it from your own Cluster class by subclassing from Dask’s Cluster superclass
>>> from distributed.deploy import Cluster >>> class MyCluster(Cluster): ... def scale_up(self, n): ... """ Bring worker count up to n """ ... def scale_down(self, workers): ... """ Remove worker addresses from cluster """
>>> cluster = MyCluster() >>> cluster.adapt(minimum=10, maximum=100)

class
distributed.deploy.
Cluster
(asynchronous)¶ Superclass for cluster objects
This class contains common functionality for Dask Cluster manager classes.
To implement this class, you must provide
 A
scheduler_comm
attribute, which is a connection to the scheduler following thedistributed.core.rpc
API.  Implement
scale
, which takes an integer and scales the cluster to that many workers, or else set_supports_scaling
to False
For that, should should get the following:
 A standard
__repr__
 A live IPython widget
 Adaptive scaling
 Integration with dasklabextension
 A
scheduler_info
attribute which contains an uptodate copy ofScheduler.identity()
, which is used for much of the above  Methods to gather logs
 A
Docker Images¶
Example docker images are maintained at https://github.com/dask/daskdocker and https://hub.docker.com/r/daskdev/ .
Each image installs the full Dask conda package (including the distributed scheduler), Numpy, and Pandas on top of a Miniconda installation on top of a Debian image.
These images are large, around 1GB.
daskdev/dask
: This a normal debian + miniconda image with the full Dask conda package (including the distributed scheduler), Numpy, and Pandas. This image is about 1GB in size.daskdev/dasknotebook
: This is based on the Jupyter basenotebook image and so it is suitable for use both normally as a Jupyter server, and also as part of a JupyterHub deployment. It also includes a matching Dask software environment described above. This image is about 2GB in size.
Example¶
Here is a simple example on the local host network
docker run it network host daskdev/dask daskscheduler # start scheduler
docker run it network host daskdev/dask daskworker localhost:8786 # start worker
docker run it network host daskdev/dask daskworker localhost:8786 # start worker
docker run it network host daskdev/dask daskworker localhost:8786 # start worker
docker run it network host daskdev/dasknotebook # start Jupyter server
Extensibility¶
Users can mildly customize the software environment by populating the
environment variables EXTRA_APT_PACKAGES
, EXTRA_CONDA_PACKAGES
, and
EXTRA_PIP_PACKAGES
. If these environment variables are set in the container,
they will trigger calls to the following respectively:
aptget install $EXTRA_APT_PACKAGES
conda install $EXTRA_CONDA_PACKAGES
pip install $EXTRA_PIP_PACKAGES
For example, the following conda
installs the joblib
package into
the Dask worker software environment:
docker run it e EXTRA_CONDA_PACKAGES="joblib" daskdev/dask daskworker localhost:8786
Note that using these can significantly delay the container from starting,
especially when using apt
, or conda
(pip
is relatively fast).
Remember that it is important for software versions to match between Dask workers and Dask clients. As a result, it is often useful to include the same extra packages in both Jupyter and Worker images.
Source¶
Docker files are maintained at https://github.com/dask/daskdocker. This repository also includes a dockercompose configuration.
Custom Initialization¶
Often we want to run custom code when we start up or tear down a scheduler or
worker. We might do this manually with functions like Client.run
or
Client.run_on_scheduler
, but this is error prone and difficult to automate.
To resolve this, Dask includes a few mechanisms to run arbitrary code around the lifecycle of a Scheduler or Worker.
Preload Scripts¶
Both daskscheduler
and daskworker
support a preload
option that
allows custom initialization of each scheduler/worker respectively. A module or
Python file passed as a preload
value is guaranteed to be imported before
establishing any connection. A dask_setup(service)
function is called if
found, with a Scheduler
or Worker
instance as the argument. As the
service stops, dask_teardown(service)
is called if present.
To support additional configuration, a single preload
module may register
additional commandline arguments by exposing dask_setup
as a Click
command. This command will be used to parse additional arguments provided to
daskworker
or daskscheduler
and will be called before service
initialization.
As an example, consider the following file that creates a scheduler plugin and registers it with the scheduler
# schedulersetup.py
import click
from distributed.diagnostics.plugin import SchedulerPlugin
class MyPlugin(SchedulerPlugin):
def __init__(self, print_count):
self.print_count = print_count
SchedulerPlugin.__init__(self)
def add_worker(self, scheduler=None, worker=None, **kwargs):
print("Added a new worker at:", worker)
if self.print_count and scheduler is not None:
print("Total workers:", len(scheduler.workers))
@click.command()
@click.option("printcount/noprintcount", default=False)
def dask_setup(scheduler, print_count):
plugin = MyPlugin(print_count)
scheduler.add_plugin(plugin)
We can then run this preload script by referring to its filename (or module name if it is on the path) when we start the scheduler:
daskscheduler preload schedulersetup.py printcount
Worker Lifecycle Plugins¶
You can also create a class with setup and teardown methods, and register that class with the scheduler to give to every worker.
Client.register_worker_plugin ([plugin, name]) 
Registers a lifecycle worker plugin for all current and future workers. 

Client.
register_worker_plugin
(plugin=None, name=None)¶ Registers a lifecycle worker plugin for all current and future workers.
This registers a new object to handle setup, task state transitions and teardown for workers in this cluster. The plugin will instantiate itself on all currently connected workers. It will also be run on any worker that connects in the future.
The plugin may include methods
setup
,teardown
, andtransition
. See thedask.distributed.WorkerPlugin
class or the examples below for the interface and docstrings. It must be serializable with the pickle or cloudpickle modules.If the plugin has a
name
attribute, or if thename=
keyword is used then that will control idempotency. A a plugin with that name has already registered then any future plugins will not run.For alternatives to plugins, you may also wish to look into preload scripts.
Parameters:  plugin: WorkerPlugin
The plugin object to pass to the workers
 name: str, optional
A name for the plugin. Registering a plugin with the same name will have no effect.
See also
distributed.WorkerPlugin
Examples
>>> class MyPlugin(WorkerPlugin): ... def __init__(self, *args, **kwargs): ... pass # the constructor is up to you ... def setup(self, worker: dask.distributed.Worker): ... pass ... def teardown(self, worker: dask.distributed.Worker): ... pass ... def transition(self, key: str, start: str, finish: str, **kwargs): ... pass
>>> plugin = MyPlugin(1, 2, 3) >>> client.register_worker_plugin(plugin)
You can get access to the plugin with the
get_worker
function>>> client.register_worker_plugin(other_plugin, name='myplugin') >>> def f(): ... worker = get_worker() ... plugin = worker.plugins['myplugin'] ... return plugin.my_state
>>> future = client.run(f)
Community¶
Dask is used and developed by individuals at a variety of institutions. It sits within the broader Python numeric ecosystem commonly referred to as PyData or SciPy.
Discussion¶
Conversation happens in the following places:
Usage questions are directed to Stack Overflow with the #dask tag. Dask developers monitor this tag and get emails whenever a question is asked
Bug reports and feature requests are managed on the GitHub issue tracker
Chat occurs on at gitter.im/dask/dask for general conversation and gitter.im/dask/dev for developer conversation. Note that because gitter chat is not searchable by future users we discourage usage questions and bug reports on gitter and instead ask people to use Stack Overflow or GitHub.
Monthly developer meeting happens the first Thursday of the month at 10:00 US Central Time in this video meeting. Meeting notes are available at https://docs.google.com/document/d/1UqNAP87a56ERH_xkQsS5Q_0PKYybd5Lj2WANy_hRzI0/edit
You can subscribe to this calendar to be notified of changes:
Asking for help¶
We welcome usage questions and bug reports from all users, even those who are new to using the project. There are a few things you can do to improve the likelihood of quickly getting a good answer.
Ask questions in the right place: We strongly prefer the use of Stack Overflow or GitHub issues over Gitter chat. GitHub and Stack Overflow are more easily searchable by future users, and therefore is more efficient for everyone’s time. Gitter chat is strictly reserved for developer and community discussion.
If you have a general question about how something should work or want best practices then use Stack Overflow. If you think you have found a bug then use GitHub
Ask only in one place: Please restrict yourself to posting your question in only one place (likely Stack Overflow or GitHub) and don’t post in both
Create a minimal example: It is ideal to create minimal, complete, verifiable examples. This significantly reduces the time that answerers spend understanding your situation, resulting in higher quality answers more quickly.
See also this blogpost about crafting minimal bug reports. These have a much higher likelihood of being answered
Paid support¶
In addition to the previous options, paid support is available from
 Anaconda: https://www.anaconda.com/support
 Quansight: https://www.quansight.com/opensourcesupport
Why Dask?¶
This document gives highlevel motivation on why people choose to adopt Dask.
Python’s role in Data Science¶
Python has grown to become the dominant language both in data analytics and general programming:
This is fueled both by computational libraries like Numpy, Pandas, and ScikitLearn and by a wealth of libraries for visualization, interactive notebooks, collaboration, and so forth.
However, these packages were not designed to scale beyond a single machine. Dask was developed to scale these packages and the surrounding ecosystem. It works with the existing Python ecosystem to scale it to multicore machines and distributed clusters.
Dask has a Familiar API¶
Analysts often use tools like Pandas, ScikitLearn, Numpy, and the rest of the Python ecosystem to analyze data on their personal computer. They like these tools because they are efficient, intuitive, and widely trusted. However, when they choose to apply their analyses to larger datasets, they find that these tools were not designed to scale beyond a single machine. And so, the analyst rewrites their computation using a more scalable tool, often in another language altogether. This rewrite process slows down discovery and causes frustration.
Dask provides ways to scale Pandas, ScikitLearn, and Numpy workflows more natively, with minimal rewriting. It integrates well with these tools so that it copies most of their API and uses their data structures internally. Moreover, Dask is codeveloped with these libraries to ensure that they evolve consistently, minimizing friction when transitioning from a local laptop, to a multicore workstation, and then to a distributed cluster. Analysts familiar with Pandas/ScikitLearn/Numpy will be immediately familiar with their Dask equivalents, and have much of their intuition carry over to a scalable context.
Dask Scales out to Clusters¶
As datasets and computations scale faster than CPUs and RAM, we need to find ways to scale our computations across multiple machines. This introduces many new concerns:
 How to have computers talk to each other over the network?
 How and when to move data between machines?
 How to recover from machine failures?
 How to deploy on an inhouse cluster?
 How to deploy on the cloud?
 How to deploy on an HPC supercomputer?
 How to provide an API to this system that users find intuitive?
 …
While it is possible to build these systems inhouse (and indeed, many exist), many organizations increasingly depend on solutions developed within the open source community. These tend to be more robust, secure, and fully featured without being tended by inhouse staff.
Dask solves the problems above. It figures out how to break up large computations and route parts of them efficiently onto distributed hardware. Dask is routinely run on thousandmachine clusters to process hundreds of terabytes of data efficiently within secure environments.
Dask has utilities and documentation on how to deploy inhouse, on the cloud, or on HPC supercomputers. It supports encryption and authentication using TLS/SSL certificates. It is resilient and can handle the failure of worker nodes gracefully and is elastic, and so can take advantage of new nodes added onthefly. Dask includes several user APIs that are used and smoothed over by thousands of researchers across the globe working in different domains.
Dask Scales Down to Single Computers¶
But a massive cluster is not always the right choice
Today’s laptops and workstations are surprisingly powerful and, if used correctly, can handle datasets and computations for which we previously depended on clusters. A modern laptop has a multicore CPU, 32GB of RAM, and flashbased hard drives that can stream through data several times faster than HDDs or SSDs of even a year or two ago.
As a result, Dask can empower analysts to manipulate 100GB+ datasets on their laptop or 1TB+ datasets on a workstation without bothering with the cluster at all. This can be preferable for the following reasons:
 They can use their local software environment, rather than being constrained by what is available on the cluster or having to manage Docker images.
 They can more easily work while in transit, at a coffee shop, or at home away from the corporate network
 Debugging errors and analyzing performance is simpler and more pleasant on a single machine
 Their iteration cycles can be faster
 Their computations may be more efficient because all of the data is local and doesn’t need to flow through the network or between separate processes
Dask can enable efficient parallel computations on single machines by leveraging their multicore CPUs and streaming data efficiently from disk. It can run on a distributed cluster, but it doesn’t have to. Dask allows you to swap out the cluster for singlemachine schedulers which are surprisingly lightweight, require no setup, and can run entirely within the same process as the user’s session.
To avoid excess memory use, Dask is good at finding ways to evaluate computations in a lowmemory footprint when possible by pulling in chunks of data from disk, doing the necessary processing, and throwing away intermediate values as quickly as possible. This lets analysts perform computations on moderately large datasets (100GB+) even on relatively lowpower laptops. This requires no configuration and no setup, meaning that adding Dask to a singlemachine computation adds very little cognitive overhead.
Dask is installed by default with Anaconda and so is already deployed on most data science machines.
Dask Integrates Natively with Python Code¶
Python includes computational libraries like Numpy, Pandas, and ScikitLearn, and many others for data access, plotting, statistics, image and signal processing, and more. These libraries work together seamlessly to produce a cohesive ecosystem of packages that coevolve to meet the needs of analysts in most domains today.
This ecosystem is tied together by common standards and protocols to which everyone adheres, which allows these packages to benefit each other in surprising and delightful ways.
Dask evolved from within this ecosystem. It abides by these standards and protocols and actively engages in community efforts to push forward new ones. This enables the rest of the ecosystem to benefit from parallel and distributed computing with minimal coordination. Dask does not seek to disrupt or displace the existing ecosystem, but rather to complement and benefit it from within.
As a result, Dask development is pushed forward by developer communities from Pandas, Numpy, ScikitLearn, ScikitImage, Jupyter, and others. This engagement from the broader community growth helps users to trust the project and helps to ensure that the Python ecosystem will continue to evolve in a smooth and sustainable manner.
Dask Supports Complex Applications¶
Some parallel computations are simple and just apply the same routine onto many inputs without any kind of coordination. These are simple to parallelize with any system.
Somewhat more complex computations can be expressed with the mapshufflereduce pattern popularized by Hadoop and Spark. This is often sufficient to do most data cleaning tasks, databasestyle queries, and some lightweight machine learning algorithms.
However, more complex parallel computations exist which do not fit into these paradigms, and so are difficult to perform with traditional bigdata technologies. These include more advanced algorithms for statistics or machine learning, time series or local operations, or bespoke parallelism often found within the systems of large enterprises.
Many companies and institutions today have problems which are clearly parallelizable, but not clearly transformable into a big DataFrame computation. Today these companies tend to solve their problems either by writing custom code with lowlevel systems like MPI, ZeroMQ, or sockets and complex queuing systems, or by shoving their problem into a standard bigdata technology like MapReduce or Spark, and hoping for the best.
Dask helps to resolve these situations by exposing lowlevel APIs to its internal task scheduler which is capable of executing very advanced computations. This gives engineers within the institution the ability to build their own parallel computing system using the same engine that powers Dask’s arrays, DataFrames, and machine learning algorithms, but now with the institution’s own custom logic. This allows engineers to keep complex business logic inhouse while still relying on Dask to handle network communication, load balancing, resilience, diagnostics, etc..
Dask Delivers Responsive Feedback¶
Because everything happens remotely, interactive parallel computing can be frustrating for users. They don’t have a good sense of how computations are progressing, what might be going wrong, or what parts of their code should they focus on for performance. The added distance between a user and their computation can drastically affect how quickly they are able to identify and resolve bugs and performance problems, which can drastically increase their time to solution.
Dask keeps users informed and content with a suite of helpful diagnostic and investigative tools including the following:
 A realtime and responsive dashboard that shows current progress, communication costs, memory use, and more, updated every 100ms
 A statistical profiler installed on every worker that polls each thread every 10ms to determine which lines in your code are taking up the most time across your entire computation
 An embedded IPython kernel in every worker and the scheduler, allowing users to directly investigate the state of their computation with a popup terminal
 The ability to reraise errors locally, so that they can use the traditional debugging tools to which they are accustomed, even when the error happens remotely
Links and More Information¶
From here you may want to read about some of our more common introductory content:
Institutional FAQ¶
Question: Is appropriate for adoption within a larger institutional context?
Answer: Yes. Dask is used within the world’s largest banks, national labs, retailers, technology companies, and government agencies. It is used in highly secure environments. It is used in conservative institutions as well as fast moving ones.
This page contains Frequently Asked Questions and concerns from institutions when they first investigate Dask.
 For Management
 For IT
 For Technical Leads
 Will Dask “just work” on our existing code?
 How well does Dask scale? What are Dask’s limitations?
 Is Dask resilient? What happens when a machine goes down?
 Is the API exactly the same as NumPy/Pandas/ScikitLearn?
 How much performance tuning does Dask require?
 What Data formats does Dask support?
 Does Dask have a SQL interface?
For Management¶
Briefly, what problem does Dask solve for us?¶
Dask is a general purpose parallel programming solution. As such it is used in many different ways.
However, the most common problem that Dask solves is connecting Python analysts to distributed hardware, particularly for data science and machine learning workloads. The institutions for whom Dask has the greatest impact are those who have a large body of Python users who are accustomed to libraries like NumPy, Pandas, Jupyter, ScikitLearn and others, but want to scale those workloads across a cluster. Often they also have distributed computing resources that are going underused.
Dask removes both technological and cultural barriers to connect Python users to computing resources in a way that is native to both the users and IT.
“Help me scale my notebook onto the cluster” is a common pain point for institutions today, and it is a common entry point for Dask usage.
Is Dask mature? Why should we trust it?¶
Yes. While Dask itself is relatively new (it began in 2015) it is built by the NumPy, Pandas, Jupyter, ScikitLearn developer community, which is well trusted. Dask is a relatively thin wrapper on top of these libraries and, as a result, the project can be relatively small and simple. It doesn’t reinvent a whole new system.
Additionally, this tight integration with the broader technology stack gives substantial benefits long term. For example:
 Because Pandas maintainers also maintain Dask, when Pandas issues a new releases Dask issues a release at the same time to ensure continuity and compatibility.
 Because ScikitLearn maintainers maintain and use Dask when they train on large clusters, you can be assured that DaskML focuses on pragmatic and important solutions like XGBoost integration, and hyperparameter selection, and that the integration between the two feels natural for novice and expert users alike.
 Because Jupyter maintainers also maintain Dask, powerful Jupyter technologies like JupyterHub and JupyterLab are designed with Dask’s needs in mind, and new features are pushed quickly to provide a first class and modern user experience.
Additionally, Dask is maintained both by a broad community of maintainers, as well as substantial institutional support (several fulltime employees each) by both Anaconda, the company behind the leading data science distribution, and NVIDIA, the leading hardware manufacturer of GPUs. Despite large corporate support, Dask remains a community governed project, and is fiscally sponsored by NumFOCUS, the same 501c3 that fiscally sponsors NumPy, Pandas, Jupyter, and many others.
Who else uses Dask?¶
Dask is used by individual researchers in practically every field today. It has millions of downloads per month, and is integrated into many PyData software packages today.
On an institutional level Dask is used by analytics and research groups in a similarly broad set of domains across both energetic startups as well as large conservative household names. A web search shows articles by Capital One, Barclays, Walmart, NASA, Los Alamos National Laboratories, and hundreds of other similar institutions.
How does Dask compare with Apache Spark?¶
This question has longer and more technical coverage here
Dask and Apache Spark are similar in that they both …
 Promise easy parallelism for data science Python users
 Provide Dataframe and ML APIs for ETL, data science, and machine learning
 Scale out to similar scales, around 11000 machines
Dask differs from Apache Spark in a few ways:
Dask is more Python native, Spark is Scala/JVM native with Python bindings.
Python users may find Dask more comfortable, but Dask is only useful for Python users, while Spark can also be used from JVM languages.
Dask is one component in the broader Python ecosystem alongside libraries like Numpy, Pandas, and ScikitLearn, while Spark is an allinone system that reinvents much of the Python world in a single package.
This means that it’s often easier to compose Dask with new problem domains, but also that you need to install multiple things (like Dask and Pandas or Dask and Numpy) rather than just having everything in an allinone solution.
Apache Spark focuses strongly on traditional business intelligence workloads, like ETL, SQL queries, and then some lightweight machine learning, while Dask is more general purpose.
This means that Dask is much more flexible and can handle other problem domains like multidimensional arrays, GIS, advanced machine learning, and custom systems, but that it is less focused and less tuned on typical SQL style computations.
If you mostly want to focus on SQL queries then Spark is probably a better bet. If you want to support a wide variety of custom workloads then Dask might be more natural.
For IT¶
How would I set up Dask on institutional hardware?¶
You already have cluster resources. Dask can run on them today without significant change.
Most institutional clusters today have a resource manager. This is typically managed by IT, with some mild permissions given to users to launch jobs. Dask works with all major resource managers today, including those on Hadoop, HPC, Kubernetes, and Cloud clusters.
Hadoop/Spark: If you have a Hadoop/Spark cluster, such as one purchased through Cloudera/Hortonworks/MapR then you will likely want to deploy Dask with YARN, the resource manager that deploys services like Hadoop, Spark, Hive, and others.
To help with this, you’ll likely want to use DaskYarn.
HPC: If you have an HPC machine that runs resource managers like SGE, SLLURM, PBS, LSF, Torque, Condor, or other job batch queuing systems, then users can launch Dask on these systems today using either:
 Dask Jobqueue , which uses typical
qsub
,sbatch
,bsub
or other submission tools in interactive settings.  Dask MPI which uses MPI for deployment in batch settings
For more information see High Performance Computers
 Dask Jobqueue , which uses typical
Kubernetes/Cloud: Newer clusters may employ Kubernetes for deployment. This is particularly commonly used today on major cloud providers, all of which provide hosted Kubernetes as a service. People today use Dask on Kubernetes using either of the following:
 Helm: an easy way to stand up a longrunning Dask cluster and Jupyter notebook
 DaskKubernetes: for native Kubernetes integration for fast moving or ephemeral deployments.
For more information see Kubernetes
Is Dask secure?¶
Dask is deployed today within highly secure institutions, including major financial, healthcare, and government agencies.
That being said it’s worth noting that, by it’s very nature, Dask enables the execution of arbitrary user code on a large set of machines. Care should be taken to isolate, authenticate, and govern access to these machines. Fortunately, your institution likely already does this and uses standard technologies like SSL/TLS, Kerberos, and other systems with which Dask can integrate.
Do I need to purchase a new cluster?¶
No. It is easy to run Dask today on most clusters. If you have a preexisting HPC or Spark/Hadoop cluster then that will be fine to start running Dask.
You can start using Dask without any capital expenditure.
How do I manage users?¶
Dask doesn’t manage users, you likely have existing systems that do this well. In a large institutional setting we assume that you already have a resource manager like Yarn (Hadoop), Kubernetes, or PBS/SLURM/SGE/LSF/…, each of which have excellent user management capabilities, which are likely preferred by your IT department anyway.
Dask is designed to operate with userlevel permissions, which means that your data science users should be able to ask those systems mentioned above for resources, and have their processes tracked accordingly.
However, there are institutions where analystlevel users aren’t given direct access to the cluster. This is particularly common in Cloudera/Hortonworks Hadoop/Spark deployments. In these cases some level of explicit indirection may be required. For this, we recommend the Dask Gateway project, which uses ITlevel permissions to properly route authenticated users into secure resources.
How do I manage software environments?¶
This depends on your cluster resource manager:
 Most HPC users use their network file system
 Hadoop/Spark/Yarn users package their environment into a tarball and ship it around with HDFS (DaskYarn integrates with Conda Pack for this capability)
 Kubernetes or Cloud users use Docker images
In each case Dask integrates with existing processes and technologies that are well understood and familiar to the institution.
How does Dask communicate data between machines?¶
Dask usually communicates over TCP, using msgpack for small administrative messages, and its own protocol for efficiently passing around large data. The scheduler and each worker host their own TCP server, making Dask a distributed peertopeer network that uses pointtopoint communication. We do not use Sparkstyle shuffle systems. We do not use MPIstyle collectives. Everything is direct pointtopoint.
For high performance networks you can use either TCPoverInfiniband for about 1 GB/s bandwidth, or UCX (experimental) for full speed communication.
Are deployments long running, or ephemeral?¶
We see both, but ephemeral deployments are more common.
Most Dask use today is about enabling data science or data engineering users to scale their interactive workloads across the cluster. These are typically either interactive sessions with Jupyter, or batch scripts that run at a predefined time. In both cases, the user asks the resource manager for a bunch of machines, does some work, and then gives up those machines.
Some institutions also use Dask in an alwayson fashion, either handling realtime traffic in a scalable way, or responding to a broad set of interactive users with large datasets that it keeps resident in memory.
For Technical Leads¶
Will Dask “just work” on our existing code?¶
No, you will need to make modifications, but these modifications are usually small.
The vast majority of lines of business logic within your institution will not have to change, assuming that they are in Python and use tooling like Numpy, Pandas and ScikitLearn.
How well does Dask scale? What are Dask’s limitations?¶
The largest Dask deployments that we see today are on around 1000 multicore machines, perhaps 20,000 cores in total, but these are rare. Most institutionallevel problems (1100 TB) are well solved by deployments of 1050 nodes.
Technically, the backoftheenvelope number to keep in mind is that each task (an individual Python function call) in Dask has an overhead of around 200 microseconds. So if these tasks take 1 second each, then Dask can saturate around 5000 cores before scheduling overhead dominates costs. As workloads reach this limit they are encouraged to use larger chunk sizes to compensate. The vast majority of institutional users though do not reach this limit. For more information you may want to peruse our best practices
Is Dask resilient? What happens when a machine goes down?¶
Yes, Dask is resilient to the failure of worker nodes. It knows how it came to any result, and can replay the necessary work on other machines if one goes down.
If Dask’s centralized scheduler goes down then you would need to resubmit the computation. This is a fairly standard level of resiliency today, shared with other tooling like Apache Spark, Flink, and others.
The resource managers that host Dask, like Yarn or Kubernetes, typically provide longterm 24/7 resilience for alwayson operation.
Is the API exactly the same as NumPy/Pandas/ScikitLearn?¶
No, but it’s very close. That being said your data scientists will still have to learn some things.
What we find is that the Numpy/Pandas/ScikitLearn APIs aren’t the challenge when institutions adopt Dask. When API inconsistencies do exist, even modestly skilled programmers are able to understand why and work around them without much pain.
Instead, the challenge is building intuition around parallel performance. We’ve all built up a mental model for what is fast and slow on a single machine. This model changes when we factor in network communication and parallel algorithms, and the performance that we get for familiar operations can be surprising.
Our main solution to build this intuition, other than accumulated experience, is Dask’s Diagnostic Dashboard. The dashboard delivers a ton of visual feedback to users as they are running their computation to help them understand what is going on. This both helps them to identify and resolve immediate bottlenecks, and also builds up that parallel performance intuition surprisingly quickly.
How much performance tuning does Dask require?¶
Some other systems are notoriously hard to tune for optimal performance. What is Dask’s story here? How many knobs are there that we need to be aware of?
Like the rest of the Python software tools, Dask puts a lot of effort into having sane defaults. Dask workers automatically detect available memory and cores, and choose sensible defaults that are decent in most situations. Dask algorithms similarly provide decent choices by default, and informative warnings when tricky situations arise, so that, in common cases, things should be fine.
The most common knobs to tune include the following:
 The thread/process mixture to deal with GILholding computations (which are rare in Numpy/Pandas/ScikitLearn workflows)
 Partition size, like if should you have 100 MB chunks or 1 GB chunks
That being said, almost no institution’s needs are met entirely by the common case, and given the variety of problems that people throw at Dask, exceptional problems are commonplace. In these cases we recommend watching the dashboard during execution to see what is going on. It can commonly inform you what’s going wrong, so that you can make changes to your system.
What Data formats does Dask support?¶
Because Dask builds on NumPy and Pandas, it supports most formats that they support, which is most formats. That being said, not all formats are well suited for parallel access. In general people using the following formats are usually pretty happy:
 Tabular: Parquet, ORC, CSV, Line Delimited JSON, Avro, text
 Arrays: HDF5, NetCDF, Zarr, GRIB
More generally, if you have a Python function that turns a chunk of your stored data into a Pandas dataframe or Numpy array then Dask can probably call that function many times without much effort.
For groups looking for advice on which formats to use, we recommend Parquet for tables and Zarr or HDF5 for arrays.
Does Dask have a SQL interface?¶
No. Dask provides no SQL support. Dask dataframe looks like and uses Pandas for these sorts of operations. It would be great to see someone build a SQL interface on top of Pandas, which Dask could then use, but this is out of scope for the core Dask project itself.
As with Pandas though, we do support a dask.dataframe.from_sql
command for
efficiently pulling data out of SQL databases for Pandas computations.
User Interfaces¶
Dask supports several user interfaces:
 HighLevel
 Arrays: Parallel NumPy
 Bags: Parallel lists
 DataFrames: Parallel Pandas
 Machine Learning : Parallel ScikitLearn
 Others from external projects, like XArray
Each of these user interfaces employs the same underlying parallel computing machinery, and so has the same scaling, diagnostics, resilience, and so on, but each provides a different set of parallel algorithms and programming style.
This document helps you to decide which user interface best suits your needs, and gives some general information that applies to all interfaces. The pages linked above give more information about each interface in greater depth.
HighLevel Collections¶
Many people who start using Dask are explicitly looking for a scalable version of NumPy, Pandas, or ScikitLearn. For these situations, the starting point within Dask is usually fairly clear. If you want scalable NumPy arrays, then start with Dask array; if you want scalable Pandas DataFrames, then start with Dask DataFrame, and so on.
These highlevel interfaces copy the standard interface with slight variations. These interfaces automatically parallelize over larger datasets for you for a large subset of the API from the original project.
# Arrays
import dask.array as da
x = da.random.uniform(low=0, high=10, size=(10000, 10000), # normal numpy code
chunks=(1000, 1000)) # break into chunks of size 1000x1000
y = x + x.T  x.mean(axis=0) # Use normal syntax for high level algorithms
# DataFrames
import dask.dataframe as dd
df = dd.read_csv('2018**.csv', parse_dates='timestamp', # normal Pandas code
blocksize=64000000) # break text into 64MB chunks
s = df.groupby('name').balance.mean() # Use normal syntax for high level algorithms
# Bags / lists
import dask.bag as db
b = db.read_text('*.json').map(json.loads)
total = (b.filter(lambda d: d['name'] == 'Alice')
.map(lambda d: d['balance'])
.sum())
It is important to remember that, while APIs may be similar, some differences do exist. Additionally, the performance of some algorithms may differ from their inmemory counterparts due to the advantages and disadvantages of parallel programming. Some thought and attention is still required when using Dask.
LowLevel Interfaces¶
Often when parallelizing existing code bases or building custom algorithms, you run into code that is parallelizable, but isn’t just a big DataFrame or array. Consider the forloopy code below:
results = []
for a in A:
for b in B:
if a < b:
c = f(a, b)
else:
c = g(a, b)
results.append(c)
There is potential parallelism in this code (the many calls to f
and g
can be done in parallel), but it’s not clear how to rewrite it into a big
array or DataFrame so that it can use a higherlevel API. Even if you could
rewrite it into one of these paradigms, it’s not clear that this would be a
good idea. Much of the meaning would likely be lost in translation, and this
process would become much more difficult for more complex systems.
Instead, Dask’s lowerlevel APIs let you write parallel code one function call at a time within the context of your existing for loops. A common solution here is to use Dask delayed to wrap individual function calls into a lazily constructed task graph:
import dask
lazy_results = []
for a in A:
for b in B:
if a < b:
c = dask.delayed(f)(a, b) # add lazy task
else:
c = dask.delayed(g)(a, b) # add lazy task
lazy_results.append(c)
results = dask.compute(*lazy_results) # compute all in parallel
Combining High and LowLevel Interfaces¶
It is common to combine high and lowlevel interfaces. For example, you might use Dask array/bag/dataframe to load in data and do initial preprocessing, then switch to Dask delayed for a custom algorithm that is specific to your domain, then switch back to Dask array/dataframe to clean up and store results. Understanding both sets of user interfaces, and how to switch between them, can be a productive combination.
# Convert to a list of delayed Pandas dataframes
delayed_values = df.to_delayed()
# Manipulate delayed values arbitrarily as you like
# Convert many delayed Pandas DataFrames back to a single Dask DataFrame
df = dd.from_delayed(delayed_values)
Laziness and Computing¶
Most Dask user interfaces are lazy, meaning that they do not evaluate until
you explicitly ask for a result using the compute
method:
# This array syntax doesn't cause computation
y = x + x.T  x.mean(axis=0)
# Trigger computation by explicitly calling the compute method
y = y.compute()
If you have multiple results that you want to compute at the same time, use the
dask.compute
function. This can share intermediate results and so be more
efficient:
# compute multiple results at the same time with the compute function
min, max = dask.compute(y.min(), y.max())
Note that the compute()
function returns inmemory results. It converts
Dask DataFrames to Pandas DataFrames, Dask arrays to NumPy arrays, and Dask
bags to lists. You should only call compute on results that will fit
comfortably in memory. If your result does not fit in memory, then you might
consider writing it to disk instead.
# Write larger results out to disk rather than store them in memory
my_dask_dataframe.to_parquet('myfile.parquet')
my_dask_array.to_hdf5('myfile.hdf5')
my_dask_bag.to_textfiles('myfile.*.txt')
Persist into Distributed Memory¶
Alternatively, if you are on a cluster, then you may want to trigger a
computation and store the results in distributed memory. In this case you do
not want to call compute
, which would create a single Pandas, NumPy, or
list result. Instead, you want to call persist
, which returns a new Dask
object that points to actively computing, or already computed results spread
around your cluster’s memory.
# Compute returns an inmemory nonDask object
y = y.compute()
# Persist returns an inmemory Dask object that uses distributed storage if available
y = y.persist()
This is common to see after data loading an preprocessing steps, but before rapid iteration, exploration, or complex algorithms. For example, we might read in a lot of data, filter down to a more manageable subset, and then persist data into memory so that we can iterate quickly.
import dask.dataframe as dd
df = dd.read_parquet('...')
df = df[df.name == 'Alice'] # select important subset of data
df = df.persist() # trigger computation in the background
# These are all relatively fast now that the relevant data is in memory
df.groupby(df.id).balance.sum().compute() # explore data quickly
df.groupby(df.id).balance.mean().compute() # explore data quickly
df.id.nunique() # explore data quickly
Lazy vs Immediate¶
As mentioned above, most Dask workloads are lazy, that is, they don’t start any
work until you explicitly trigger them with a call to compute()
.
However, sometimes you do want to submit work as quickly as possible, track it
over time, submit new work or cancel work depending on partial results, and so
on. This can be useful when tracking or responding to realtime events,
handling streaming data, or when building complex and adaptive algorithms.
For these situations, people typically turn to the futures interface which is a lowlevel interface like Dask delayed, but operates immediately rather than lazily.
Here is the same example with Dask delayed and Dask futures to illustrate the difference.
Delayed: Lazy¶
@dask.delayed
def inc(x):
return x + 1
@dask.delayed
def add(x, y):
return x + y
a = inc(1) # no work has happened yet
b = inc(2) # no work has happened yet
c = add(a, b) # no work has happened yet
c = c.compute() # This triggers all of the above computations
Futures: Immediate¶
from dask.distributed import Client
client = Client()
def inc(x):
return x + 1
def add(x, y):
return x + y
a = client.submit(inc, 1) # work starts immediately
b = client.submit(inc, 2) # work starts immediately
c = client.submit(add, a, b) # work starts immediately
c = c.result() # block until work finishes, then gather result
You can also trigger work with the highlevel collections using the
persist
function. This will cause work to happen in the background when
using the distributed scheduler.
Combining Interfaces¶
There are established ways to combine the interfaces above:
The highlevel interfaces (array, bag, dataframe) have a
to_delayed
method that can convert to a sequence (or grid) of Dask delayed objectsdelayeds = df.to_delayed()
The highlevel interfaces (array, bag, dataframe) have a
from_delayed
method that can convert from either Delayed or Future objectsdf = dd.from_delayed(delayeds) df = dd.from_delayed(futures)
The
Client.compute
method converts Delayed objects into Futuresfutures = client.compute(delayeds)
The
dask.distributed.futures_of
function gathers futures from persisted collectionsfrom dask.distributed import futures_of df = df.persist() # start computation in the background futures = futures_of(df)
The Dask.delayed object converts Futures into delayed objects
delayed_value = dask.delayed(future)
The approaches above should suffice to convert any interface into any other. We often see some antipatterns that do not work as well:
 Calling lowlevel APIs (delayed or futures) on highlevel objects (like
Dask arrays or DataFrames). This downgrades those objects to their NumPy or
Pandas equivalents, which may not be desired.
Often people are looking for APIs like
dask.array.map_blocks
ordask.dataframe.map_partitions
instead.  Calling
compute()
on Future objects. Often people want the.result()
method instead.  Calling NumPy/Pandas functions on highlevel Dask objects or highlevel Dask functions on NumPy/Pandas objects
Conclusion¶
Most people who use Dask start with only one of the interfaces above but eventually learn how to use a few interfaces together. This helps them leverage the sophisticated algorithms in the highlevel interfaces while also working around tricky problems with the lowlevel interfaces.
For more information, see the documentation for the particular user interfaces below:
 High Level
 Arrays: Parallel NumPy
 Bags: Parallel lists
 DataFrames: Parallel Pandas
 Machine Learning : Parallel ScikitLearn
 Others from external projects, like XArray
Array¶
API¶
Top level user functions:
all (a[, axis, out, keepdims]) 
Test whether all array elements along a given axis evaluate to True. 
allclose (arr1, arr2[, rtol, atol, equal_nan]) 
Returns True if two arrays are elementwise equal within a tolerance. 
angle (x[, deg]) 
Return the angle of the complex argument. 
any (a[, axis, out, keepdims]) 
Test whether any array element along a given axis evaluates to True. 
apply_along_axis (func1d, axis, arr, *args[, …]) 
Apply a function to 1D slices along the given axis. 
apply_over_axes (func, a, axes) 
Apply a function repeatedly over multiple axes. 
arange (*args, **kwargs) 
Return evenly spaced values from start to stop with step size step. 
arccos (x, /[, out, where, casting, order, …]) 
Trigonometric inverse cosine, elementwise. 
arccosh (x, /[, out, where, casting, order, …]) 
Inverse hyperbolic cosine, elementwise. 
arcsin (x, /[, out, where, casting, order, …]) 
Inverse sine, elementwise. 
arcsinh (x, /[, out, where, casting, order, …]) 
Inverse hyperbolic sine elementwise. 
arctan (x, /[, out, where, casting, order, …]) 
Trigonometric inverse tangent, elementwise. 
arctan2 (x1, x2, /[, out, where, casting, …]) 
Elementwise arc tangent of x1/x2 choosing the quadrant correctly. 
arctanh (x, /[, out, where, casting, order, …]) 
Inverse hyperbolic tangent elementwise. 
argmax (a[, axis, out]) 
Returns the indices of the maximum values along an axis. 
argmin (a[, axis, out]) 
Returns the indices of the minimum values along an axis. 
argtopk (a, k[, axis, split_every]) 
Extract the indices of the k largest elements from a on the given axis, and return them sorted from largest to smallest. 
argwhere (a) 
Find the indices of array elements that are nonzero, grouped by element. 
around (x[, decimals]) 
Evenly round to the given number of decimals. 
array (object[, dtype, copy, order, subok, ndmin]) 
This docstring was copied from numpy.array. 
asanyarray (a) 
Convert the input to a dask array. 
asarray (a, **kwargs) 
Convert the input to a dask array. 
atleast_1d (*arys) 
Convert inputs to arrays with at least one dimension. 
atleast_2d (*arys) 
View inputs as arrays with at least two dimensions. 
atleast_3d (*arys) 
View inputs as arrays with at least three dimensions. 
average (a[, axis, weights, returned]) 
Compute the weighted average along the specified axis. 
bincount (x[, weights, minlength]) 
This docstring was copied from numpy.bincount. 
bitwise_and (x1, x2, /[, out, where, …]) 
Compute the bitwise AND of two arrays elementwise. 
bitwise_not (x, /[, out, where, casting, …]) 
Compute bitwise inversion, or bitwise NOT, elementwise. 
bitwise_or (x1, x2, /[, out, where, casting, …]) 
Compute the bitwise OR of two arrays elementwise. 
bitwise_xor (x1, x2, /[, out, where, …]) 
Compute the bitwise XOR of two arrays elementwise. 
block (arrays[, allow_unknown_chunksizes]) 
Assemble an ndarray from nested lists of blocks. 
blockwise (func, out_ind, *args[, name, …]) 
Tensor operation: Generalized inner and outer products 
broadcast_arrays (*args, **kwargs) 
Broadcast any number of arrays against each other. 
broadcast_to (x, shape[, chunks]) 
Broadcast an array to a new shape. 
coarsen (reduction, x, axes[, trim_excess]) 
Coarsen array by applying reduction to fixed size neighborhoods 
ceil (x, /[, out, where, casting, order, …]) 
Return the ceiling of the input, elementwise. 
choose (a, choices) 
Construct an array from an index array and a set of arrays to choose from. 
clip (*args, **kwargs) 
Clip (limit) the values in an array. 
compress (condition, a[, axis]) 
Return selected slices of an array along given axis. 
concatenate (seq[, axis, …]) 
Concatenate arrays along an existing axis 
conj (x, /[, out, where, casting, order, …]) 
Return the complex conjugate, elementwise. 
copysign (x1, x2, /[, out, where, casting, …]) 
Change the sign of x1 to that of x2, elementwise. 
corrcoef (x[, y, rowvar]) 
Return Pearson productmoment correlation coefficients. 
cos (x, /[, out, where, casting, order, …]) 
Cosine elementwise. 
cosh (x, /[, out, where, casting, order, …]) 
Hyperbolic cosine, elementwise. 
count_nonzero (a[, axis]) 
Counts the number of nonzero values in the array a . 
cov (m[, y, rowvar, bias, ddof]) 
Estimate a covariance matrix, given data and weights. 
cumprod (a[, axis, dtype, out]) 
Return the cumulative product of elements along a given axis. 
cumsum (a[, axis, dtype, out]) 
Return the cumulative sum of the elements along a given axis. 
deg2rad (x, /[, out, where, casting, order, …]) 
Convert angles from degrees to radians. 
degrees (x, /[, out, where, casting, order, …]) 
Convert angles from radians to degrees. 
diag (v) 
Extract a diagonal or construct a diagonal array. 
diagonal (a[, offset, axis1, axis2]) 
Return specified diagonals. 
diff (a[, n, axis]) 
Calculate the nth discrete difference along the given axis. 
divmod (x1, x2[, out1, out2], / [[, out, …]) 
Return elementwise quotient and remainder simultaneously. 
digitize (a, bins[, right]) 
Return the indices of the bins to which each value in input array belongs. 
dot (a, b[, out]) 
This docstring was copied from numpy.dot. 
dstack (tup[, allow_unknown_chunksizes]) 
Stack arrays in sequence depth wise (along third axis). 
ediff1d (ary[, to_end, to_begin]) 
The differences between consecutive elements of an array. 
einsum (subscripts, *operands[, out, dtype, …]) 
This docstring was copied from numpy.einsum. 
empty (*args, **kwargs) 
Blocked variant of empty 
empty_like (a[, dtype, chunks]) 
Return a new array with the same shape and type as a given array. 
exp (x, /[, out, where, casting, order, …]) 
Calculate the exponential of all elements in the input array. 
expm1 (x, /[, out, where, casting, order, …]) 
Calculate exp(x)  1 for all elements in the array. 
eye (N[, chunks, M, k, dtype]) 
Return a 2D Array with ones on the diagonal and zeros elsewhere. 
fabs (x, /[, out, where, casting, order, …]) 
Compute the absolute values elementwise. 
fix (*args, **kwargs) 
Round to nearest integer towards zero. 
flatnonzero (a) 
Return indices that are nonzero in the flattened version of a. 
flip (m, axis) 
Reverse element order along axis. 
flipud (m) 
Flip array in the up/down direction. 
fliplr (m) 
Flip array in the left/right direction. 
floor (x, /[, out, where, casting, order, …]) 
Return the floor of the input, elementwise. 
fmax (x1, x2, /[, out, where, casting, …]) 
Elementwise maximum of array elements. 
fmin (x1, x2, /[, out, where, casting, …]) 
Elementwise minimum of array elements. 
fmod (x1, x2, /[, out, where, casting, …]) 
Return the elementwise remainder of division. 
frexp (x[, out1, out2], / [[, out, where, …]) 
Decompose the elements of x into mantissa and twos exponent. 
fromfunction (func[, chunks, shape, dtype]) 
Construct an array by executing a function over each coordinate. 
frompyfunc (func, nin, nout) 
This docstring was copied from numpy.frompyfunc. 
full (*args, **kwargs) 
Blocked variant of full 
full_like (a, fill_value[, dtype, chunks]) 
Return a full array with the same shape and type as a given array. 
gradient (f, *varargs, **kwargs) 
Return the gradient of an Ndimensional array. 
histogram (a[, bins, range, normed, weights, …]) 
Blocked variant of numpy.histogram() . 
hstack (tup[, allow_unknown_chunksizes]) 
Stack arrays in sequence horizontally (column wise). 
hypot (x1, x2, /[, out, where, casting, …]) 
Given the “legs” of a right triangle, return its hypotenuse. 
imag (*args, **kwargs) 
Return the imaginary part of the complex argument. 
indices (dimensions[, dtype, chunks]) 
Implements NumPy’s indices for Dask Arrays. 
insert (arr, obj, values, axis) 
Insert values along the given axis before the given indices. 
invert (x, /[, out, where, casting, order, …]) 
Compute bitwise inversion, or bitwise NOT, elementwise. 
isclose (arr1, arr2[, rtol, atol, equal_nan]) 
Returns a boolean array where two arrays are elementwise equal within a tolerance. 
iscomplex (*args, **kwargs) 
Returns a bool array, where True if input element is complex. 
isfinite (x, /[, out, where, casting, order, …]) 
Test elementwise for finiteness (not infinity or not Not a Number). 
isin (element, test_elements[, …]) 
Calculates element in test_elements, broadcasting over element only. 
isinf (x, /[, out, where, casting, order, …]) 
Test elementwise for positive or negative infinity. 
isneginf 
Return (x1 == x2) elementwise. 
isnan (x, /[, out, where, casting, order, …]) 
Test elementwise for NaN and return result as a boolean array. 
isnull (values) 
pandas.isnull for dask arrays 
isposinf 
Return (x1 == x2) elementwise. 
isreal (*args, **kwargs) 
Returns a bool array, where True if input element is real. 
ldexp (x1, x2, /[, out, where, casting, …]) 
Returns x1 * 2**x2, elementwise. 
linspace (start, stop[, num, endpoint, …]) 
Return num evenly spaced values over the closed interval [start, stop]. 
log (x, /[, out, where, casting, order, …]) 
Natural logarithm, elementwise. 
log10 (x, /[, out, where, casting, order, …]) 
Return the base 10 logarithm of the input array, elementwise. 
log1p (x, /[, out, where, casting, order, …]) 
Return the natural logarithm of one plus the input array, elementwise. 
log2 (x, /[, out, where, casting, order, …]) 
Base2 logarithm of x. 
logaddexp (x1, x2, /[, out, where, casting, …]) 
Logarithm of the sum of exponentiations of the inputs. 
logaddexp2 (x1, x2, /[, out, where, casting, …]) 
Logarithm of the sum of exponentiations of the inputs in base2. 
logical_and (x1, x2, /[, out, where, …]) 
Compute the truth value of x1 AND x2 elementwise. 
logical_not (x, /[, out, where, casting, …]) 
Compute the truth value of NOT x elementwise. 
logical_or (x1, x2, /[, out, where, casting, …]) 
Compute the truth value of x1 OR x2 elementwise. 
logical_xor (x1, x2, /[, out, where, …]) 
Compute the truth value of x1 XOR x2, elementwise. 
map_overlap (x, func, depth[, boundary, trim]) 
Map a function over blocks of the array with some overlap 
map_blocks (func, *args[, name, token, …]) 
Map a function across all blocks of a dask array. 
matmul (x1, x2, /[, out, casting, order, …]) 
This docstring was copied from numpy.matmul. 
max (a[, axis, out, keepdims, initial]) 
Return the maximum of an array or maximum along an axis. 
maximum (x1, x2, /[, out, where, casting, …]) 
Elementwise maximum of array elements. 
mean (a[, axis, dtype, out, keepdims]) 
Compute the arithmetic mean along the specified axis. 
meshgrid (*xi, **kwargs) 
Return coordinate matrices from coordinate vectors. 
min (a[, axis, out, keepdims, initial]) 
Return the minimum of an array or minimum along an axis. 
minimum (x1, x2, /[, out, where, casting, …]) 
Elementwise minimum of array elements. 
modf (x[, out1, out2], / [[, out, where, …]) 
Return the fractional and integral parts of an array, elementwise. 
moment (a, order[, axis, dtype, keepdims, …]) 

moveaxis (a, source, destination) 
Move axes of an array to new positions. 
nanargmax (x, axis, **kwargs) 

nanargmin (x, axis, **kwargs) 

nancumprod (a[, axis, dtype, out]) 
Return the cumulative product of array elements over a given axis treating Not a Numbers (NaNs) as one. 
nancumsum (a[, axis, dtype, out]) 
Return the cumulative sum of array elements over a given axis treating Not a Numbers (NaNs) as zero. 
nanmax (a[, axis, out, keepdims]) 
Return the maximum of an array or maximum along an axis, ignoring any NaNs. 
nanmean (a[, axis, dtype, out, keepdims]) 
Compute the arithmetic mean along the specified axis, ignoring NaNs. 
nanmin (a[, axis, out, keepdims]) 
Return minimum of an array or minimum along an axis, ignoring any NaNs. 
nanprod (a[, axis, dtype, out, keepdims]) 
Return the product of array elements over a given axis treating Not a Numbers (NaNs) as ones. 
nanstd (a[, axis, dtype, out, ddof, keepdims]) 
Compute the standard deviation along the specified axis, while ignoring NaNs. 
nansum (a[, axis, dtype, out, keepdims]) 
Return the sum of array elements over a given axis treating Not a Numbers (NaNs) as zero. 
nanvar (a[, axis, dtype, out, ddof, keepdims]) 
Compute the variance along the specified axis, while ignoring NaNs. 
nan_to_num (*args, **kwargs) 
Replace NaN with zero and infinity with large finite numbers. 
nextafter (x1, x2, /[, out, where, casting, …]) 
Return the next floatingpoint value after x1 towards x2, elementwise. 
nonzero (a) 
Return the indices of the elements that are nonzero. 
notnull (values) 
pandas.notnull for dask arrays 
ones (*args, **kwargs) 
Blocked variant of ones 
ones_like (a[, dtype, chunks]) 
Return an array of ones with the same shape and type as a given array. 
outer (a, b) 
Compute the outer product of two vectors. 
pad (array, pad_width, mode, **kwargs) 
Pads an array. 
percentile (a, q[, interpolation, method]) 
Approximate percentile of 1D array 
PerformanceWarning 
A warning given when bad chunking may cause poor performance 
piecewise (x, condlist, funclist, *args, **kw) 
Evaluate a piecewisedefined function. 
prod (a[, axis, dtype, out, keepdims, initial]) 
Return the product of array elements over a given axis. 
ptp (a[, axis]) 
Range of values (maximum  minimum) along an axis. 
rad2deg (x, /[, out, where, casting, order, …]) 
Convert angles from radians to degrees. 
radians (x, /[, out, where, casting, order, …]) 
Convert angles from degrees to radians. 
ravel (array) 
Return a contiguous flattened array. 
real (*args, **kwargs) 
Return the real part of the complex argument. 
rechunk (x[, chunks, threshold, block_size_limit]) 
Convert blocks in dask array x for new chunks. 
reduction (x, chunk, aggregate[, axis, …]) 
General version of reductions 
repeat (a, repeats[, axis]) 
Repeat elements of an array. 
reshape (x, shape) 
Reshape array to new shape 
result_type (*arrays_and_dtypes) 
This docstring was copied from numpy.result_type. 
rint (x, /[, out, where, casting, order, …]) 
Round elements of the array to the nearest integer. 
roll (array, shift[, axis]) 
Roll array elements along a given axis. 
rollaxis (a, axis[, start]) 

round (a[, decimals]) 
Round an array to the given number of decimals. 
sign (x, /[, out, where, casting, order, …]) 
Returns an elementwise indication of the sign of a number. 
signbit (x, /[, out, where, casting, order, …]) 
Returns elementwise True where signbit is set (less than zero). 
sin (x, /[, out, where, casting, order, …]) 
Trigonometric sine, elementwise. 
sinh (x, /[, out, where, casting, order, …]) 
Hyperbolic sine, elementwise. 
sqrt (x, /[, out, where, casting, order, …]) 
Return the nonnegative squareroot of an array, elementwise. 
square (x, /[, out, where, casting, order, …]) 
Return the elementwise square of the input. 
squeeze (a[, axis]) 
Remove singledimensional entries from the shape of an array. 
stack (seq[, axis]) 
Stack arrays along a new axis 
std (a[, axis, dtype, out, ddof, keepdims]) 
Compute the standard deviation along the specified axis. 
sum (a[, axis, dtype, out, keepdims, initial]) 
Sum of array elements over a given axis. 
take (a, indices[, axis]) 
Take elements from an array along an axis. 
tan (x, /[, out, where, casting, order, …]) 
Compute tangent elementwise. 
tanh (x, /[, out, where, casting, order, …]) 
Compute hyperbolic tangent elementwise. 
tensordot (lhs, rhs[, axes]) 
Compute tensor dot product along specified axes for arrays >= 1D. 
tile (A, reps) 
Construct an array by repeating A the number of times given by reps. 
topk (a, k[, axis, split_every]) 
Extract the k largest elements from a on the given axis, and return them sorted from largest to smallest. 
trace (a[, offset, axis1, axis2, dtype, out]) 
Return the sum along diagonals of the array. 
transpose (a[, axes]) 
Permute the dimensions of an array. 
tril (m[, k]) 
Lower triangle of an array with elements above the kth diagonal zeroed. 
triu (m[, k]) 
Upper triangle of an array with elements above the kth diagonal zeroed. 
trunc (x, /[, out, where, casting, order, …]) 
Return the truncated value of the input, elementwise. 
unify_chunks (*args, **kwargs) 
Unify chunks across a sequence of arrays 
unique (ar[, return_index, return_inverse, …]) 
Find the unique elements of an array. 
unravel_index (indices, shape[, order]) 
This docstring was copied from numpy.unravel_index. 
var (a[, axis, dtype, out, ddof, keepdims]) 
Compute the variance along the specified axis. 
vdot (a, b) 
This docstring was copied from numpy.vdot. 
vstack (tup[, allow_unknown_chunksizes]) 
Stack arrays in sequence vertically (row wise). 
where (condition, [x, y]) 
This docstring was copied from numpy.where. 
zeros (*args, **kwargs) 
Blocked variant of zeros 
zeros_like (a[, dtype, chunks]) 
Return an array of zeros with the same shape and type as a given array. 
Fast Fourier Transforms¶
fft.fft_wrap (fft_func[, kind, dtype]) 
Wrap 1D, 2D, and ND real and complex FFT functions 
fft.fft (a[, n, axis]) 
Wrapping of numpy.fft.fft 
fft.fft2 (a[, s, axes]) 
Wrapping of numpy.fft.fft2 
fft.fftn (a[, s, axes]) 
Wrapping of numpy.fft.fftn 
fft.ifft (a[, n, axis]) 
Wrapping of numpy.fft.ifft 
fft.ifft2 (a[, s, axes]) 
Wrapping of numpy.fft.ifft2 
fft.ifftn (a[, s, axes]) 
Wrapping of numpy.fft.ifftn 
fft.rfft (a[, n, axis]) 
Wrapping of numpy.fft.rfft 
fft.rfft2 (a[, s, axes]) 
Wrapping of numpy.fft.rfft2 
fft.rfftn (a[, s, axes]) 
Wrapping of numpy.fft.rfftn 
fft.irfft (a[, n, axis]) 
Wrapping of numpy.fft.irfft 
fft.irfft2 (a[, s, axes]) 
Wrapping of numpy.fft.irfft2 
fft.irfftn (a[, s, axes]) 
Wrapping of numpy.fft.irfftn 
fft.hfft (a[, n, axis]) 
Wrapping of numpy.fft.hfft 
fft.ihfft (a[, n, axis]) 
Wrapping of numpy.fft.ihfft 
fft.fftfreq (n[, d, chunks]) 
Return the Discrete Fourier Transform sample frequencies. 
fft.rfftfreq (n[, d, chunks]) 
Return the Discrete Fourier Transform sample frequencies (for usage with rfft, irfft). 
fft.fftshift (x[, axes]) 
Shift the zerofrequency component to the center of the spectrum. 
fft.ifftshift (x[, axes]) 
The inverse of fftshift. 
Linear Algebra¶
linalg.cholesky (a[, lower]) 
Returns the Cholesky decomposition, \(A = L L^*\) or \(A = U^* U\) of a Hermitian positivedefinite matrix A. 
linalg.inv (a) 
Compute the inverse of a matrix with LU decomposition and forward / backward substitutions. 
linalg.lstsq (a, b) 
Return the leastsquares solution to a linear matrix equation using QR decomposition. 
linalg.lu (a) 
Compute the lu decomposition of a matrix. 
linalg.norm (x[, ord, axis, keepdims]) 
Matrix or vector norm. 
linalg.qr (a) 
Compute the qr factorization of a matrix. 
linalg.solve (a, b[, sym_pos]) 
Solve the equation a x = b for x . 
linalg.solve_triangular (a, b[, lower]) 
Solve the equation a x = b for x, assuming a is a triangular matrix. 
linalg.svd (a) 
Compute the singular value decomposition of a matrix. 
linalg.svd_compressed (a, k[, n_power_iter, …]) 
Randomly compressed rankk thin Singular Value Decomposition. 
linalg.sfqr (data[, name]) 
Direct ShortandFat QR 
linalg.tsqr (data[, compute_svd, …]) 
Direct TallandSkinny QR algorithm 
Masked Arrays¶
ma.average (a[, axis, weights, returned]) 
Return the weighted average of array over the given axis. 
ma.filled (a[, fill_value]) 
Return input as an array with masked data replaced by a fill value. 
ma.fix_invalid (a[, fill_value]) 
Return input with invalid data masked and replaced by a fill value. 
ma.getdata (a) 
Return the data of a masked array as an ndarray. 
ma.getmaskarray (a) 
Return the mask of a masked array, or full boolean array of False. 
ma.masked_array (data[, mask, fill_value]) 
An array class with possibly masked values. 
ma.masked_equal (a, value) 
Mask an array where equal to a given value. 
ma.masked_greater (x, value[, copy]) 
Mask an array where greater than a given value. 
ma.masked_greater_equal (x, value[, copy]) 
Mask an array where greater than or equal to a given value. 
ma.masked_inside (x, v1, v2) 
Mask an array inside a given interval. 
ma.masked_invalid (a) 
Mask an array where invalid values occur (NaNs or infs). 
ma.masked_less (x, value[, copy]) 
Mask an array where less than a given value. 
ma.masked_less_equal (x, value[, copy]) 
Mask an array where less than or equal to a given value. 
ma.masked_not_equal (x, value[, copy]) 
Mask an array where not equal to a given value. 
ma.masked_outside (x, v1, v2) 
Mask an array outside a given interval. 
ma.masked_values (x, value[, rtol, atol, shrink]) 
Mask using floating point equality. 
ma.masked_where (condition, a) 
Mask an array where a condition is met. 
ma.set_fill_value (a, fill_value) 
Set the filling value of a, if a is a masked array. 
Random¶
random.beta (a, b[, size]) 
Draw samples from a Beta distribution. 
random.binomial (n, p[, size]) 
Draw samples from a binomial distribution. 
random.chisquare (df[, size]) 
Draw samples from a chisquare distribution. 
random.choice (a[, size, replace, p]) 
Generates a random sample from a given 1D array 
random.exponential ([scale, size]) 
Draw samples from an exponential distribution. 
random.f (dfnum, dfden[, size]) 
Draw samples from an F distribution. 
random.gamma (shape[, scale, size]) 
Draw samples from a Gamma distribution. 
random.geometric (p[, size]) 
Draw samples from the geometric distribution. 
random.gumbel ([loc, scale, size]) 
Draw samples from a Gumbel distribution. 
random.hypergeometric (ngood, nbad, nsample) 
Draw samples from a Hypergeometric distribution. 
random.laplace ([loc, scale, size]) 
Draw samples from the Laplace or double exponential distribution with specified location (or mean) and scale (decay). 
random.logistic ([loc, scale, size]) 
Draw samples from a logistic distribution. 
random.lognormal ([mean, sigma, size]) 
Draw samples from a lognormal distribution. 
random.logseries (p[, size]) 
Draw samples from a logarithmic series distribution. 
random.negative_binomial (n, p[, size]) 
Draw samples from a negative binomial distribution. 
random.noncentral_chisquare (df, nonc[, size]) 
Draw samples from a noncentral chisquare distribution. 
random.noncentral_f (dfnum, dfden, nonc[, size]) 
Draw samples from the noncentral F distribution. 
random.normal ([loc, scale, size]) 
Draw random samples from a normal (Gaussian) distribution. 
random.pareto (a[, size]) 
Draw samples from a Pareto II or Lomax distribution with specified shape. 
random.permutation (x) 
Randomly permute a sequence, or return a permuted range. 
random.poisson ([lam, size]) 
Draw samples from a Poisson distribution. 
random.power (a[, size]) 
Draws samples in [0, 1] from a power distribution with positive exponent a  1. 
random.randint (low[, high, size, dtype]) 
Return random integers from low (inclusive) to high (exclusive). 
random.random ([size]) 
Return random floats in the halfopen interval [0.0, 1.0). 
random.random_sample ([size]) 
Return random floats in the halfopen interval [0.0, 1.0). 
random.rayleigh ([scale, size]) 
Draw samples from a Rayleigh distribution. 
random.standard_cauchy ([size]) 
Draw samples from a standard Cauchy distribution with mode = 0. 
random.standard_exponential ([size]) 
Draw samples from the standard exponential distribution. 
random.standard_gamma (shape[, size]) 
Draw samples from a standard Gamma distribution. 
random.standard_normal ([size]) 
Draw samples from a standard Normal distribution (mean=0, stdev=1). 
random.standard_t (df[, size]) 
Draw samples from a standard Student’s t distribution with df degrees of freedom. 
random.triangular (left, mode, right[, size]) 
Draw samples from the triangular distribution over the interval [left, right] . 
random.uniform ([low, high, size]) 
Draw samples from a uniform distribution. 
random.vonmises (mu, kappa[, size]) 
Draw samples from a von Mises distribution. 
random.wald (mean, scale[, size]) 
Draw samples from a Wald, or inverse Gaussian, distribution. 
random.weibull (a[, size]) 
Draw samples from a Weibull distribution. 
random.zipf (a[, size]) 
Standard distributions 
Stats¶
stats.ttest_ind (a, b[, axis, equal_var]) 
Calculate the Ttest for the means of two independent samples of scores. 
stats.ttest_1samp (a, popmean[, axis, nan_policy]) 
Calculate the Ttest for the mean of ONE group of scores. 
stats.ttest_rel (a, b[, axis, nan_policy]) 
Calculate the Ttest on TWO RELATED samples of scores, a and b. 
stats.chisquare (f_obs[, f_exp, ddof, axis]) 
Calculate a oneway chi square test. 
stats.power_divergence (f_obs[, f_exp, ddof, …]) 
CressieRead power divergence statistic and goodness of fit test. 
stats.skew (a[, axis, bias, nan_policy]) 
Compute the skewness of a data set. 
stats.skewtest (a[, axis, nan_policy]) 
Test whether the skew is different from the normal distribution. 
stats.kurtosis (a[, axis, fisher, bias, …]) 
Compute the kurtosis (Fisher or Pearson) of a dataset. 
stats.kurtosistest (a[, axis, nan_policy]) 
Test whether a dataset has normal kurtosis. 
stats.normaltest (a[, axis, nan_policy]) 
Test whether a sample differs from a normal distribution. 
stats.f_oneway (*args) 
Performs a 1way ANOVA. 
stats.moment (a[, moment, axis, nan_policy]) 
Calculate the nth moment about the mean for a sample. 
Image Support¶
image.imread (filename[, imread, preprocess]) 
Read a stack of images into a dask array 
Slightly Overlapping Computations¶
overlap.overlap (x, depth, boundary) 
Share boundaries between neighboring blocks 
overlap.map_overlap (x, func, depth[, …]) 
Map a function over blocks of the array with some overlap 
overlap.trim_internal (x, axes[, boundary]) 
Trim sides from each block 
overlap.trim_overlap (x, depth[, boundary]) 
Trim sides from each block. 
Create and Store Arrays¶
from_array (x[, chunks, name, lock, asarray, …]) 
Create dask array from something that looks like an array 
from_delayed (value, shape[, dtype, meta, name]) 
Create a dask array from a dask delayed value 
from_npy_stack (dirname[, mmap_mode]) 
Load dask array from stack of npy files 
from_zarr (url[, component, storage_options, …]) 
Load array from the zarr storage format 
from_tiledb (uri[, attribute, chunks, …]) 
Load array from the TileDB storage format 
store (sources, targets[, lock, regions, …]) 
Store dask arrays in arraylike objects, overwrite data in target 
to_hdf5 (filename, *args, **kwargs) 
Store arrays in HDF5 file 
to_zarr (arr, url[, component, …]) 
Save array to the zarr storage format 
to_npy_stack (dirname, x[, axis]) 
Write dask array to a stack of .npy files 
to_tiledb (darray, uri[, compute, …]) 
Save array to the TileDB storage format 
Generalized Ufuncs¶
apply_gufunc (func, signature, *args, **kwargs) 
Apply a generalized ufunc or similar python function to arrays. 
as_gufunc ([signature]) 
Decorator for dask.array.gufunc . 
gufunc (pyfunc, **kwargs) 
Binds pyfunc into dask.array.apply_gufunc when called. 
Internal functions¶
blockwise (func, out_ind, *args[, name, …]) 
Tensor operation: Generalized inner and outer products 
normalize_chunks (chunks[, shape, limit, …]) 
Normalize chunks to tuple of tuples 
Other functions¶

dask.array.
from_array
(x, chunks='auto', name=None, lock=False, asarray=None, fancy=True, getitem=None, meta=None)¶ Create dask array from something that looks like an array
Input must have a
.shape
,.ndim
,.dtype
and support numpystyle slicing.Parameters:  x : array_like
 chunks : int, tuple
How to chunk the array. Must be one of the following forms:  A blocksize like 1000.  A blockshape like (1000, 1000).  Explicit sizes of all blocks along all dimensions like
((1000, 1000, 500), (400, 400)).
 A size in bytes, like “100 MiB” which will choose a uniform blocklike shape
 The word “auto” which acts like the above, but uses a configuration
value
array.chunksize
for the chunk size
1 or None as a blocksize indicate the size of the corresponding dimension.
 name : str, optional
The key name to use for the array. Defaults to a hash of
x
. By default, hash uses python’s standard sha1. This behaviour can be changed by installing cityhash, xxhash or murmurhash. If installed, a largefactor speedup can be obtained in the tokenisation step. Usename=False
to generate a random name instead of hashing (fast) lock : bool or Lock, optional
If
x
doesn’t support concurrent reads then provide a lock here, or pass in True to have dask.array create one for you. asarray : bool, optional
If True then call np.asarray on chunks to convert them to numpy arrays. If False then chunks are passed through unchanged. If None (default) then we use True if the
__array_function__
method is undefined. fancy : bool, optional
If
x
doesn’t support fancy indexing (e.g. indexing with lists or arrays) then set to False. Default is True. meta : Arraylike, optional
The metadata for the resulting dask array. This is the kind of array that will result from slicing the input array. Defaults to the input array.
Examples
>>> x = h5py.File('...')['/data/path'] # doctest: +SKIP >>> a = da.from_array(x, chunks=(1000, 1000)) # doctest: +SKIP
If your underlying datastore does not support concurrent reads then include the
lock=True
keyword argument orlock=mylock
if you want multiple arrays to coordinate around the same lock.>>> a = da.from_array(x, chunks=(1000, 1000), lock=True) # doctest: +SKIP
If your underlying datastore has a
.chunks
attribute (as h5py and zarr datasets do) then a multiple of that chunk shape will be used if you do not provide a chunk shape.>>> a = da.from_array(x, chunks='auto') # doctest: +SKIP >>> a = da.from_array(x, chunks='100 MiB') # doctest: +SKIP >>> a = da.from_array(x) # doctest: +SKIP

dask.array.
from_delayed
(value, shape, dtype=None, meta=None, name=None)¶ Create a dask array from a dask delayed value
This routine is useful for constructing dask arrays in an adhoc fashion using dask delayed, particularly when combined with stack and concatenate.
The dask array will consist of a single chunk.
Examples
>>> import dask >>> import dask.array as da >>> value = dask.delayed(np.ones)(5) >>> array = da.from_delayed(value, (5,), dtype=float) >>> array dask.array<fromvalue, shape=(5,), dtype=float64, chunksize=(5,), chunktype=numpy.ndarray> >>> array.compute() array([1., 1., 1., 1., 1.])

dask.array.
store
(sources, targets, lock=True, regions=None, compute=True, return_stored=False, **kwargs)¶ Store dask arrays in arraylike objects, overwrite data in target
This stores dask arrays into object that supports numpystyle setitem indexing. It stores values chunk by chunk so that it does not have to fill up memory. For best performance you can align the block size of the storage target with the block size of your array.
If your data fits in memory then you may prefer calling
np.array(myarray)
instead.Parameters:  sources: Array or iterable of Arrays
 targets: arraylike or Delayed or iterable of arraylikes and/or Delayeds
These should support setitem syntax
target[10:20] = ...
 lock: boolean or threading.Lock, optional
Whether or not to lock the data stores while storing. Pass True (lock each file individually), False (don’t lock) or a particular
threading.Lock
object to be shared among all writes. regions: tuple of slices or list of tuples of slices
Each
region
tuple inregions
should be such thattarget[region].shape = source.shape
for the corresponding source and target in sources and targets, respectively. If this is a tuple, the contents will be assumed to be slices, so do not provide a tuple of tuples. compute: boolean, optional
If true compute immediately, return
dask.delayed.Delayed
otherwise return_stored: boolean, optional
Optionally return the stored result (default False).
Examples
>>> x = ... # doctest: +SKIP
>>> import h5py # doctest: +SKIP >>> f = h5py.File('myfile.hdf5', mode='a') # doctest: +SKIP >>> dset = f.create_dataset('/data', shape=x.shape, ... chunks=x.chunks, ... dtype='f8') # doctest: +SKIP
>>> store(x, dset) # doctest: +SKIP
Alternatively store many arrays at the same time
>>> store([x, y, z], [dset1, dset2, dset3]) # doctest: +SKIP

dask.array.
coarsen
(reduction, x, axes, trim_excess=False)¶ Coarsen array by applying reduction to fixed size neighborhoods
Parameters:  reduction: function
Function like np.sum, np.mean, etc…
 x: np.ndarray
Array to be coarsened
 axes: dict
Mapping of axis to coarsening factor
Examples
>>> x = np.array([1, 2, 3, 4, 5, 6]) >>> coarsen(np.sum, x, {0: 2}) array([ 3, 7, 11]) >>> coarsen(np.max, x, {0: 3}) array([3, 6])
Provide dictionary of scale per dimension
>>> x = np.arange(24).reshape((4, 6)) >>> x array([[ 0, 1, 2, 3, 4, 5], [ 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17], [18, 19, 20, 21, 22, 23]])
>>> coarsen(np.min, x, {0: 2, 1: 3}) array([[ 0, 3], [12, 15]])
You must avoid excess elements explicitly
>>> x = np.array([1, 2, 3, 4, 5, 6, 7, 8]) >>> coarsen(np.min, x, {0: 3}, trim_excess=True) array([1, 4])

dask.array.
stack
(seq, axis=0)¶ Stack arrays along a new axis
Given a sequence of dask arrays, form a new dask array by stacking them along a new dimension (axis=0 by default)
See also
Examples
Create slices
>>> import dask.array as da >>> import numpy as np
>>> data = [from_array(np.ones((4, 4)), chunks=(2, 2)) ... for i in range(3)]
>>> x = da.stack(data, axis=0) >>> x.shape (3, 4, 4)
>>> da.stack(data, axis=1).shape (4, 3, 4)
>>> da.stack(data, axis=1).shape (4, 4, 3)
Result is a new dask Array

dask.array.
concatenate
(seq, axis=0, allow_unknown_chunksizes=False)¶ Concatenate arrays along an existing axis
Given a sequence of dask Arrays form a new dask Array by stacking them along an existing dimension (axis=0 by default)
Parameters:  seq: list of dask.arrays
 axis: int
Dimension along which to align all of the arrays
 allow_unknown_chunksizes: bool
Allow unknown chunksizes, such as come from converting from dask dataframes. Dask.array is unable to verify that chunks line up. If data comes from differently aligned sources then this can cause unexpected results.
See also
Examples
Create slices
>>> import dask.array as da >>> import numpy as np
>>> data = [from_array(np.ones((4, 4)), chunks=(2, 2)) ... for i in range(3)]
>>> x = da.concatenate(data, axis=0) >>> x.shape (12, 4)
>>> da.concatenate(data, axis=1).shape (4, 12)
Result is a new dask Array

dask.array.
all
(a, axis=None, out=None, keepdims=<no value>)¶ Test whether all array elements along a given axis evaluate to True.
Parameters:  a : array_like
Input array or object that can be converted to an array.
 axis : None or int or tuple of ints, optional
Axis or axes along which a logical AND reduction is performed. The default (axis = None) is to perform a logical AND over all the dimensions of the input array. axis may be negative, in which case it counts from the last to the first axis.
New in version 1.7.0.
If this is a tuple of ints, a reduction is performed on multiple axes, instead of a single axis or all the axes as before.
 out : ndarray, optional
Alternate output array in which to place the result. It must have the same shape as the expected output and its type is preserved (e.g., if
dtype(out)
is float, the result will consist of 0.0’s and 1.0’s). See doc.ufuncs (Section “Output arguments”) for more details. keepdims : bool, optional
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.
If the default value is passed, then keepdims will not be passed through to the all method of subclasses of ndarray, however any nondefault value will be. If the subclass’ method does not implement keepdims any exceptions will be raised.
Returns:  all : ndarray, bool
A new boolean or array is returned unless out is specified, in which case a reference to out is returned.
See also
ndarray.all
 equivalent method
any
 Test whether any element along a given axis evaluates to True.
Notes
Not a Number (NaN), positive infinity and negative infinity evaluate to True because these are not equal to zero.
Examples
>>> np.all([[True,False],[True,True]]) False
>>> np.all([[True,False],[True,True]], axis=0) array([ True, False])
>>> np.all([1, 4, 5]) True
>>> np.all([1.0, np.nan]) True
>>> o=np.array([False]) >>> z=np.all([1, 4, 5], out=o) >>> id(z), id(o), z # doctest: +SKIP (28293632, 28293632, array([ True]))

dask.array.
allclose
(arr1, arr2, rtol=1e05, atol=1e08, equal_nan=False)¶ Returns True if two arrays are elementwise equal within a tolerance.
This docstring was copied from numpy.allclose.
Some inconsistencies with the Dask version may exist.
The tolerance values are positive, typically very small numbers. The relative difference (rtol * abs(b)) and the absolute difference atol are added together to compare against the absolute difference between a and b.
If either array contains one or more NaNs, False is returned. Infs are treated as equal if they are in the same place and of the same sign in both arrays.
Parameters:  a, b : array_like
Input arrays to compare.
 rtol : float
The relative tolerance parameter (see Notes).
 atol : float
The absolute tolerance parameter (see Notes).
 equal_nan : bool
Whether to compare NaN’s as equal. If True, NaN’s in a will be considered equal to NaN’s in b in the output array.
New in version 1.10.0.
Returns:  allclose : bool
Returns True if the two arrays are equal within the given tolerance; False otherwise.
Notes
If the following equation is elementwise True, then allclose returns True.
absolute(a  b) <= (atol + rtol * absolute(b))The above equation is not symmetric in a and b, so that
allclose(a, b)
might be different fromallclose(b, a)
in some rare cases.The comparison of a and b uses standard broadcasting, which means that a and b need not have the same shape in order for
allclose(a, b)
to evaluate to True. The same is true for equal but not array_equal.Examples
>>> np.allclose([1e10,1e7], [1.00001e10,1e8]) # doctest: +SKIP False >>> np.allclose([1e10,1e8], [1.00001e10,1e9]) # doctest: +SKIP True >>> np.allclose([1e10,1e8], [1.0001e10,1e9]) # doctest: +SKIP False >>> np.allclose([1.0, np.nan], [1.0, np.nan]) # doctest: +SKIP False >>> np.allclose([1.0, np.nan], [1.0, np.nan], equal_nan=True) # doctest: +SKIP True

dask.array.
angle
(x, deg=0)¶ Return the angle of the complex argument.
Parameters:  z : array_like
A complex number or sequence of complex numbers.
 deg : bool, optional
Return angle in degrees if True, radians if False (default).
Returns:  angle : ndarray or scalar
The counterclockwise angle from the positive real axis on the complex plane, with dtype as numpy.float64.
 ..versionchanged:: 1.16.0
This function works on subclasses of ndarray like ma.array.
See also
arctan2
,absolute
Examples
>>> np.angle([1.0, 1.0j, 1+1j]) # in radians # doctest: +SKIP array([ 0. , 1.57079633, 0.78539816]) >>> np.angle(1+1j, deg=True) # in degrees # doctest: +SKIP 45.0

dask.array.
any
(a, axis=None, out=None, keepdims=<no value>)¶ Test whether any array element along a given axis evaluates to True.
Returns single boolean unless axis is not
None
Parameters:  a : array_like
Input array or object that can be converted to an array.
 axis : None or int or tuple of ints, optional
Axis or axes along which a logical OR reduction is performed. The default (axis = None) is to perform a logical OR over all the dimensions of the input array. axis may be negative, in which case it counts from the last to the first axis.
New in version 1.7.0.
If this is a tuple of ints, a reduction is performed on multiple axes, instead of a single axis or all the axes as before.
 out : ndarray, optional
Alternate output array in which to place the result. It must have the same shape as the expected output and its type is preserved (e.g., if it is of type float, then it will remain so, returning 1.0 for True and 0.0 for False, regardless of the type of a). See doc.ufuncs (Section “Output arguments”) for details.
 keepdims : bool, optional
If this is set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.
If the default value is passed, then keepdims will not be passed through to the any method of subclasses of ndarray, however any nondefault value will be. If the subclass’ method does not implement keepdims any exceptions will be raised.
Returns:  any : bool or ndarray
A new boolean or ndarray is returned unless out is specified, in which case a reference to out is returned.
See also
ndarray.any
 equivalent method
all
 Test whether all elements along a given axis evaluate to True.
Notes
Not a Number (NaN), positive infinity and negative infinity evaluate to True because these are not equal to zero.
Examples
>>> np.any([[True, False], [True, True]]) True
>>> np.any([[True, False], [False, False]], axis=0) array([ True, False])
>>> np.any([1, 0, 5]) True
>>> np.any(np.nan) True
>>> o=np.array([False]) >>> z=np.any([1, 4, 5], out=o) >>> z, o (array([ True]), array([ True])) >>> # Check now that z is a reference to o >>> z is o True >>> id(z), id(o) # identity of z and o # doctest: +SKIP (191614240, 191614240)

dask.array.
apply_along_axis
(func1d, axis, arr, *args, dtype=None, shape=None, **kwargs)¶ Apply a function to 1D slices along the given axis.
This docstring was copied from numpy.apply_along_axis.
Some inconsistencies with the Dask version may exist.
Apply a function to 1D slices along the given axis. This is a blocked variant of
numpy.apply_along_axis()
implemented viadask.array.map_blocks()
 func1d : callable
 Function to apply to 1D slices of the array along the given axis
 axis : int
 Axis along which func1d will be applied
 arr : dask array
 Dask array to which
func1d
will be applied  args : any
 Additional arguments to
func1d
.  dtype : str or dtype, optional
 The dtype of the output of
func1d
.  shape : tuple, optional
 The shape of the output of
func1d
.  kwargs : any
 Additional keyword arguments for
func1d
.
Parameters:  func1d : function (M,) > (Nj…)
This function should accept 1D arrays. It is applied to 1D slices of arr along the specified axis.
 axis : integer
Axis along which arr is sliced.
 arr : ndarray (Ni…, M, Nk…)
Input array.
 args : any
Additional arguments to func1d.
 kwargs : any
Additional named arguments to func1d.
New in version 1.9.0.
Returns:  out : ndarray (Ni…, Nj…, Nk…)
The output array. The shape of out is identical to the shape of arr, except along the axis dimension. This axis is removed, and replaced with new dimensions equal to the shape of the return value of func1d. So if func1d returns a scalar out will have one fewer dimensions than arr.
See also
apply_over_axes
 Apply a function repeatedly over multiple axes.
Notes
If either of dtype or shape are not provided, Dask attempts to determine them by calling func1d on a dummy array. This may produce incorrect values for dtype or shape, so we recommend providing them.
Execute func1d(a, *args) where func1d operates on 1D arrays and a is a 1D slice of arr along axis.
This is equivalent to (but faster than) the following use of ndindex and s_, which sets each of
ii
,jj
, andkk
to a tuple of indices:Ni, Nk = a.shape[:axis], a.shape[axis+1:] for ii in ndindex(Ni): for kk in ndindex(Nk): f = func1d(arr[ii + s_[:,] + kk]) Nj = f.shape for jj in ndindex(Nj): out[ii + jj + kk] = f[jj]
Equivalently, eliminating the inner loop, this can be expressed as:
Ni, Nk = a.shape[:axis], a.shape[axis+1:] for ii in ndindex(Ni): for kk in ndindex(Nk): out[ii + s_[...,] + kk] = func1d(arr[ii + s_[:,] + kk])
Examples
>>> def my_func(a): # doctest: +SKIP ... """Average first and last element of a 1D array""" ... return (a[0] + a[1]) * 0.5 >>> b = np.array([[1,2,3], [4,5,6], [7,8,9]]) # doctest: +SKIP >>> np.apply_along_axis(my_func, 0, b) # doctest: +SKIP array([ 4., 5., 6.]) >>> np.apply_along_axis(my_func, 1, b) # doctest: +SKIP array([ 2., 5., 8.])
For a function that returns a 1D array, the number of dimensions in outarr is the same as arr.
>>> b = np.array([[8,1,7], [4,3,9], [5,2,6]]) # doctest: +SKIP >>> np.apply_along_axis(sorted, 1, b) # doctest: +SKIP array([[1, 7, 8], [3, 4, 9], [2, 5, 6]])
For a function that returns a higher dimensional array, those dimensions are inserted in place of the axis dimension.
>>> b = np.array([[1,2,3], [4,5,6], [7,8,9]]) # doctest: +SKIP >>> np.apply_along_axis(np.diag, 1, b) # doctest: +SKIP array([[[1, 0, 0], [0, 2, 0], [0, 0, 3]], [[4, 0, 0], [0, 5, 0], [0, 0, 6]], [[7, 0, 0], [0, 8, 0], [0, 0, 9]]])

dask.array.
apply_over_axes
(func, a, axes)¶ Apply a function repeatedly over multiple axes.
This docstring was copied from numpy.apply_over_axes.
Some inconsistencies with the Dask version may exist.
func is called as res = func(a, axis), where axis is the first element of axes. The result res of the function call must have either the same dimensions as a or one less dimension. If res has one less dimension than a, a dimension is inserted before axis. The call to func is then repeated for each axis in axes, with res as the first argument.
Parameters:  func : function
This function must take two arguments, func(a, axis).
 a : array_like
Input array.
 axes : array_like
Axes over which func is applied; the elements must be integers.
Returns:  apply_over_axis : ndarray
The output array. The number of dimensions is the same as a, but the shape can be different. This depends on whether func changes the shape of its output with respect to its input.
See also
apply_along_axis
 Apply a function to 1D slices of an array along the given axis.
Notes
This function is equivalent to tuple axis arguments to reorderable ufuncs with keepdims=True. Tuple axis arguments to ufuncs have been available since version 1.7.0.
Examples
>>> a = np.arange(24).reshape(2,3,4) # doctest: +SKIP >>> a # doctest: +SKIP array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]], [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]])
Sum over axes 0 and 2. The result has same number of dimensions as the original array:
>>> np.apply_over_axes(np.sum, a, [0,2]) # doctest: +SKIP array([[[ 60], [ 92], [124]]])
Tuple axis arguments to ufuncs are equivalent:
>>> np.sum(a, axis=(0,2), keepdims=True) # doctest: +SKIP array([[[ 60], [ 92], [124]]])

dask.array.
arange
(*args, **kwargs)¶ Return evenly spaced values from start to stop with step size step.
The values are halfopen [start, stop), so including start and excluding stop. This is basically the same as python’s range function but for dask arrays.
When using a noninteger step, such as 0.1, the results will often not be consistent. It is better to use linspace for these cases.
Parameters:  start : int, optional
The starting value of the sequence. The default is 0.
 stop : int
The end of the interval, this value is excluded from the interval.
 step : int, optional
The spacing between the values. The default is 1 when not specified. The last value of the sequence.
 chunks : int
The number of samples on each block. Note that the last block will have fewer samples if
len(array) % chunks != 0
. dtype : numpy.dtype
Output dtype. Omit to infer it from start, stop, step
Returns:  samples : dask array
See also

dask.array.
arccos
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Trigonometric inverse cosine, elementwise.
The inverse of cos so that, if
y = cos(x)
, thenx = arccos(y)
.Parameters:  x : array_like
xcoordinate on the unit circle. For real arguments, the domain is [1, 1].
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  angle : ndarray
The angle of the ray intersecting the unit circle at the given xcoordinate in radians [0, pi]. This is a scalar if x is a scalar.
Notes
arccos is a multivalued function: for each x there are infinitely many numbers z such that cos(z) = x. The convention is to return the angle z whose real part lies in [0, pi].
For realvalued input data types, arccos always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, arccos is a complex analytic function that has branch cuts [inf, 1] and [1, inf] and is continuous from above on the former and from below on the latter.
The inverse cos is also known as acos or cos^1.
References
M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 79. http://www.math.sfu.ca/~cbm/aands/
Examples
We expect the arccos of 1 to be 0, and of 1 to be pi:
>>> np.arccos([1, 1]) # doctest: +SKIP array([ 0. , 3.14159265])
Plot arccos:
>>> import matplotlib.pyplot as plt # doctest: +SKIP >>> x = np.linspace(1, 1, num=100) # doctest: +SKIP >>> plt.plot(x, np.arccos(x)) # doctest: +SKIP >>> plt.axis('tight') # doctest: +SKIP >>> plt.show() # doctest: +SKIP

dask.array.
arccosh
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Inverse hyperbolic cosine, elementwise.
Parameters:  x : array_like
Input array.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  arccosh : ndarray
Array of the same shape as x. This is a scalar if x is a scalar.
Notes
arccosh is a multivalued function: for each x there are infinitely many numbers z such that cosh(z) = x. The convention is to return the z whose imaginary part lies in [pi, pi] and the real part in
[0, inf]
.For realvalued input data types, arccosh always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, arccosh is a complex analytical function that has a branch cut [inf, 1] and is continuous from above on it.
References
[1] M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 86. http://www.math.sfu.ca/~cbm/aands/ [2] Wikipedia, “Inverse hyperbolic function”, https://en.wikipedia.org/wiki/Arccosh Examples
>>> np.arccosh([np.e, 10.0]) # doctest: +SKIP array([ 1.65745445, 2.99322285]) >>> np.arccosh(1) # doctest: +SKIP 0.0

dask.array.
arcsin
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Inverse sine, elementwise.
Parameters:  x : array_like
ycoordinate on the unit circle.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  angle : ndarray
The inverse sine of each element in x, in radians and in the closed interval
[pi/2, pi/2]
. This is a scalar if x is a scalar.
Notes
arcsin is a multivalued function: for each x there are infinitely many numbers z such that \(sin(z) = x\). The convention is to return the angle z whose real part lies in [pi/2, pi/2].
For realvalued input data types, arcsin always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, arcsin is a complex analytic function that has, by convention, the branch cuts [inf, 1] and [1, inf] and is continuous from above on the former and from below on the latter.
The inverse sine is also known as asin or sin^{1}.
References
Abramowitz, M. and Stegun, I. A., Handbook of Mathematical Functions, 10th printing, New York: Dover, 1964, pp. 79ff. http://www.math.sfu.ca/~cbm/aands/
Examples
>>> np.arcsin(1) # pi/2 # doctest: +SKIP 1.5707963267948966 >>> np.arcsin(1) # pi/2 # doctest: +SKIP 1.5707963267948966 >>> np.arcsin(0) # doctest: +SKIP 0.0

dask.array.
arcsinh
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Inverse hyperbolic sine elementwise.
Parameters:  x : array_like
Input array.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  out : ndarray or scalar
Array of the same shape as x. This is a scalar if x is a scalar.
Notes
arcsinh is a multivalued function: for each x there are infinitely many numbers z such that sinh(z) = x. The convention is to return the z whose imaginary part lies in [pi/2, pi/2].
For realvalued input data types, arcsinh always returns real output. For each value that cannot be expressed as a real number or infinity, it returns
nan
and sets the invalid floating point error flag.For complexvalued input, arccos is a complex analytical function that has branch cuts [1j, infj] and [1j, infj] and is continuous from the right on the former and from the left on the latter.
The inverse hyperbolic sine is also known as asinh or
sinh^1
.References
[1] M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 86. http://www.math.sfu.ca/~cbm/aands/ [2] Wikipedia, “Inverse hyperbolic function”, https://en.wikipedia.org/wiki/Arcsinh Examples
>>> np.arcsinh(np.array([np.e, 10.0])) # doctest: +SKIP array([ 1.72538256, 2.99822295])

dask.array.
arctan
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Trigonometric inverse tangent, elementwise.
The inverse of tan, so that if
y = tan(x)
thenx = arctan(y)
.Parameters:  x : array_like
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  out : ndarray or scalar
Out has the same shape as x. Its real part is in
[pi/2, pi/2]
(arctan(+/inf)
returns+/pi/2
). This is a scalar if x is a scalar.
See also
Notes
arctan is a multivalued function: for each x there are infinitely many numbers z such that tan(z) = x. The convention is to return the angle z whose real part lies in [pi/2, pi/2].
For realvalued input data types, arctan always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, arctan is a complex analytic function that has [1j, infj] and [1j, infj] as branch cuts, and is continuous from the left on the former and from the right on the latter.
The inverse tangent is also known as atan or tan^{1}.
References
Abramowitz, M. and Stegun, I. A., Handbook of Mathematical Functions, 10th printing, New York: Dover, 1964, pp. 79. http://www.math.sfu.ca/~cbm/aands/
Examples
We expect the arctan of 0 to be 0, and of 1 to be pi/4:
>>> np.arctan([0, 1]) # doctest: +SKIP array([ 0. , 0.78539816])
>>> np.pi/4 # doctest: +SKIP 0.78539816339744828
Plot arctan:
>>> import matplotlib.pyplot as plt # doctest: +SKIP >>> x = np.linspace(10, 10) # doctest: +SKIP >>> plt.plot(x, np.arctan(x)) # doctest: +SKIP >>> plt.axis('tight') # doctest: +SKIP >>> plt.show() # doctest: +SKIP

dask.array.
arctan2
(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Elementwise arc tangent of
x1/x2
choosing the quadrant correctly.The quadrant (i.e., branch) is chosen so that
arctan2(x1, x2)
is the signed angle in radians between the ray ending at the origin and passing through the point (1,0), and the ray ending at the origin and passing through the point (x2, x1). (Note the role reversal: the “ycoordinate” is the first function parameter, the “xcoordinate” is the second.) By IEEE convention, this function is defined for x2 = +/0 and for either or both of x1 and x2 = +/inf (see Notes for specific values).This function is not defined for complexvalued arguments; for the socalled argument of complex values, use angle.
Parameters:  x1 : array_like, realvalued
ycoordinates.
 x2 : array_like, realvalued
xcoordinates. x2 must be broadcastable to match the shape of x1 or vice versa.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  angle : ndarray
Array of angles in radians, in the range
[pi, pi]
. This is a scalar if both x1 and x2 are scalars.
Notes
arctan2 is identical to the atan2 function of the underlying C library. The following special values are defined in the C standard: [1]
x1 x2 arctan2(x1,x2) +/ 0 +0 +/ 0 +/ 0 0 +/ pi > 0 +/inf +0 / +pi < 0 +/inf 0 / pi +/inf +inf +/ (pi/4) +/inf inf +/ (3*pi/4) Note that +0 and 0 are distinct floating point numbers, as are +inf and inf.
References
[1] (1, 2) ISO/IEC standard 9899:1999, “Programming language C.” Examples
Consider four points in different quadrants:
>>> x = np.array([1, +1, +1, 1]) # doctest: +SKIP >>> y = np.array([1, 1, +1, +1]) # doctest: +SKIP >>> np.arctan2(y, x) * 180 / np.pi # doctest: +SKIP array([135., 45., 45., 135.])
Note the order of the parameters. arctan2 is defined also when x2 = 0 and at several other special points, obtaining values in the range
[pi, pi]
:>>> np.arctan2([1., 1.], [0., 0.]) # doctest: +SKIP array([ 1.57079633, 1.57079633]) >>> np.arctan2([0., 0., np.inf], [+0., 0., np.inf]) # doctest: +SKIP array([ 0. , 3.14159265, 0.78539816])

dask.array.
arctanh
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Inverse hyperbolic tangent elementwise.
Parameters:  x : array_like
Input array.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  out : ndarray or scalar
Array of the same shape as x. This is a scalar if x is a scalar.
See also
emath.arctanh
Notes
arctanh is a multivalued function: for each x there are infinitely many numbers z such that tanh(z) = x. The convention is to return the z whose imaginary part lies in [pi/2, pi/2].
For realvalued input data types, arctanh always returns real output. For each value that cannot be expressed as a real number or infinity, it yields
nan
and sets the invalid floating point error flag.For complexvalued input, arctanh is a complex analytical function that has branch cuts [1, inf] and [1, inf] and is continuous from above on the former and from below on the latter.
The inverse hyperbolic tangent is also known as atanh or
tanh^1
.References
[1] M. Abramowitz and I.A. Stegun, “Handbook of Mathematical Functions”, 10th printing, 1964, pp. 86. http://www.math.sfu.ca/~cbm/aands/ [2] Wikipedia, “Inverse hyperbolic function”, https://en.wikipedia.org/wiki/Arctanh Examples
>>> np.arctanh([0, 0.5]) # doctest: +SKIP array([ 0. , 0.54930614])

dask.array.
argmax
(a, axis=None, out=None)¶ Returns the indices of the maximum values along an axis.
Parameters:  a : array_like
Input array.
 axis : int, optional
By default, the index is into the flattened array, otherwise along the specified axis.
 out : array, optional
If provided, the result will be inserted into this array. It should be of the appropriate shape and dtype.
Returns:  index_array : ndarray of ints
Array of indices into the array. It has the same shape as a.shape with the dimension along axis removed.
See also
ndarray.argmax
,argmin
amax
 The maximum value along a given axis.
unravel_index
 Convert a flat index into an index tuple.
Notes
In case of multiple occurrences of the maximum values, the indices corresponding to the first occurrence are returned.
Examples
>>> a = np.arange(6).reshape(2,3) + 10 >>> a array([[10, 11, 12], [13, 14, 15]]) >>> np.argmax(a) 5 >>> np.argmax(a, axis=0) array([1, 1, 1]) >>> np.argmax(a, axis=1) array([2, 2])
Indexes of the maximal elements of a Ndimensional array:
>>> ind = np.unravel_index(np.argmax(a, axis=None), a.shape) >>> ind (1, 2) >>> a[ind] 15
>>> b = np.arange(6) >>> b[1] = 5 >>> b array([0, 5, 2, 3, 4, 5]) >>> np.argmax(b) # Only the first occurrence is returned. 1

dask.array.
argmin
(a, axis=None, out=None)¶ Returns the indices of the minimum values along an axis.
Parameters:  a : array_like
Input array.
 axis : int, optional
By default, the index is into the flattened array, otherwise along the specified axis.
 out : array, optional
If provided, the result will be inserted into this array. It should be of the appropriate shape and dtype.
Returns:  index_array : ndarray of ints
Array of indices into the array. It has the same shape as a.shape with the dimension along axis removed.
See also
ndarray.argmin
,argmax
amin
 The minimum value along a given axis.
unravel_index
 Convert a flat index into an index tuple.
Notes
In case of multiple occurrences of the minimum values, the indices corresponding to the first occurrence are returned.
Examples
>>> a = np.arange(6).reshape(2,3) + 10 >>> a array([[10, 11, 12], [13, 14, 15]]) >>> np.argmin(a) 0 >>> np.argmin(a, axis=0) array([0, 0, 0]) >>> np.argmin(a, axis=1) array([0, 0])
Indices of the minimum elements of a Ndimensional array:
>>> ind = np.unravel_index(np.argmin(a, axis=None), a.shape) >>> ind (0, 0) >>> a[ind] 10
>>> b = np.arange(6) + 10 >>> b[4] = 10 >>> b array([10, 11, 12, 13, 10, 15]) >>> np.argmin(b) # Only the first occurrence is returned. 0

dask.array.
argtopk
(a, k, axis=1, split_every=None)¶ Extract the indices of the k largest elements from a on the given axis, and return them sorted from largest to smallest. If k is negative, extract the indices of the k smallest elements instead, and return them sorted from smallest to largest.
This performs best when
k
is much smaller than the chunk size. All results will be returned in a single chunk along the given axis.Parameters:  x: Array
Data being sorted
 k: int
 axis: int, optional
 split_every: int >=2, optional
See
topk()
. The performance considerations for topk also apply here.
Returns:  Selection of np.intp indices of x with size abs(k) along the given axis.
Examples
>>> import dask.array as da >>> x = np.array([5, 1, 3, 6]) >>> d = da.from_array(x, chunks=2) >>> d.argtopk(2).compute() array([3, 0]) >>> d.argtopk(2).compute() array([1, 2])

dask.array.
argwhere
(a)¶ Find the indices of array elements that are nonzero, grouped by element.
This docstring was copied from numpy.argwhere.
Some inconsistencies with the Dask version may exist.
Parameters:  a : array_like
Input data.
Returns:  index_array : ndarray
Indices of elements that are nonzero. Indices are grouped by element.
Notes
np.argwhere(a)
is the same asnp.transpose(np.nonzero(a))
.The output of
argwhere
is not suitable for indexing arrays. For this purpose usenonzero(a)
instead.Examples
>>> x = np.arange(6).reshape(2,3) # doctest: +SKIP >>> x # doctest: +SKIP array([[0, 1, 2], [3, 4, 5]]) >>> np.argwhere(x>1) # doctest: +SKIP array([[0, 2], [1, 0], [1, 1], [1, 2]])

dask.array.
around
(x, decimals=0)¶ Evenly round to the given number of decimals.
This docstring was copied from numpy.around.
Some inconsistencies with the Dask version may exist.
Parameters:  a : array_like (Not supported in Dask)
Input data.
 decimals : int, optional
Number of decimal places to round to (default: 0). If decimals is negative, it specifies the number of positions to the left of the decimal point.
 out : ndarray, optional (Not supported in Dask)
Alternative output array in which to place the result. It must have the same shape as the expected output, but the type of the output values will be cast if necessary. See doc.ufuncs (Section “Output arguments”) for details.
Returns:  rounded_array : ndarray
An array of the same type as a, containing the rounded values. Unless out was specified, a new array is created. A reference to the result is returned.
The real and imaginary parts of complex numbers are rounded separately. The result of rounding a float is a float.
Notes
For values exactly halfway between rounded decimal values, NumPy rounds to the nearest even value. Thus 1.5 and 2.5 round to 2.0, 0.5 and 0.5 round to 0.0, etc. Results may also be surprising due to the inexact representation of decimal fractions in the IEEE floating point standard [1] and errors introduced when scaling by powers of ten.
References
[1] (1, 2) “Lecture Notes on the Status of IEEE 754”, William Kahan, https://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF [2] “How Futile are Mindless Assessments of Roundoff in FloatingPoint Computation?”, William Kahan, https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf Examples
>>> np.around([0.37, 1.64]) # doctest: +SKIP array([ 0., 2.]) >>> np.around([0.37, 1.64], decimals=1) # doctest: +SKIP array([ 0.4, 1.6]) >>> np.around([.5, 1.5, 2.5, 3.5, 4.5]) # rounds to nearest even value # doctest: +SKIP array([ 0., 2., 2., 4., 4.]) >>> np.around([1,2,3,11], decimals=1) # ndarray of ints is returned # doctest: +SKIP array([ 1, 2, 3, 11]) >>> np.around([1,2,3,11], decimals=1) # doctest: +SKIP array([ 0, 0, 0, 10])

dask.array.
array
(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)¶ This docstring was copied from numpy.array.
Some inconsistencies with the Dask version may exist.
Create an array.
Parameters:  object : array_like
An array, any object exposing the array interface, an object whose __array__ method returns an array, or any (nested) sequence.
 dtype : datatype, optional
The desired datatype for the array. If not given, then the type will be determined as the minimum type required to hold the objects in the sequence. This argument can only be used to ‘upcast’ the array. For downcasting, use the .astype(t) method.
 copy : bool, optional
If true (default), then the object is copied. Otherwise, a copy will only be made if __array__ returns a copy, if obj is a nested sequence, or if a copy is needed to satisfy any of the other requirements (dtype, order, etc.).
 order : {‘K’, ‘A’, ‘C’, ‘F’}, optional
Specify the memory layout of the array. If object is not an array, the newly created array will be in C order (row major) unless ‘F’ is specified, in which case it will be in Fortran order (column major). If object is an array the following holds.
order no copy copy=True ‘K’ unchanged F & C order preserved, otherwise most similar order ‘A’ unchanged F order if input is F and not C, otherwise C order ‘C’ C order C order ‘F’ F order F order When
copy=False
and a copy is made for other reasons, the result is the same as ifcopy=True
, with some exceptions for A, see the Notes section. The default order is ‘K’. subok : bool, optional
If True, then subclasses will be passedthrough, otherwise the returned array will be forced to be a baseclass array (default).
 ndmin : int, optional
Specifies the minimum number of dimensions that the resulting array should have. Ones will be prepended to the shape as needed to meet this requirement.
Returns:  out : ndarray
An array object satisfying the specified requirements.
See also
empty_like
 Return an empty array with shape and type of input.
ones_like
 Return an array of ones with shape and type of input.
zeros_like
 Return an array of zeros with shape and type of input.
full_like
 Return a new array with shape of input filled with value.
empty
 Return a new uninitialized array.
ones
 Return a new array setting values to one.
zeros
 Return a new array setting values to zero.
full
 Return a new array of given shape filled with value.
Notes
When order is ‘A’ and object is an array in neither ‘C’ nor ‘F’ order, and a copy is forced by a change in dtype, then the order of the result is not necessarily ‘C’ as expected. This is likely a bug.
Examples
>>> np.array([1, 2, 3]) # doctest: +SKIP array([1, 2, 3])
Upcasting:
>>> np.array([1, 2, 3.0]) # doctest: +SKIP array([ 1., 2., 3.])
More than one dimension:
>>> np.array([[1, 2], [3, 4]]) # doctest: +SKIP array([[1, 2], [3, 4]])
Minimum dimensions 2:
>>> np.array([1, 2, 3], ndmin=2) # doctest: +SKIP array([[1, 2, 3]])
Type provided:
>>> np.array([1, 2, 3], dtype=complex) # doctest: +SKIP array([ 1.+0.j, 2.+0.j, 3.+0.j])
Datatype consisting of more than one element:
>>> x = np.array([(1,2),(3,4)],dtype=[('a','<i4'),('b','<i4')]) # doctest: +SKIP >>> x['a'] # doctest: +SKIP array([1, 3])
Creating an array from subclasses:
>>> np.array(np.mat('1 2; 3 4')) # doctest: +SKIP array([[1, 2], [3, 4]])
>>> np.array(np.mat('1 2; 3 4'), subok=True) # doctest: +SKIP matrix([[1, 2], [3, 4]])

dask.array.
asanyarray
(a)¶ Convert the input to a dask array.
Subclasses of
np.ndarray
will be passed through as chunks unchanged.Parameters:  a : arraylike
Input data, in any form that can be converted to a dask array.
Returns:  out : dask array
Dask array interpretation of a.
Examples
>>> import dask.array as da >>> import numpy as np >>> x = np.arange(3) >>> da.asanyarray(x) dask.array<array, shape=(3,), dtype=int64, chunksize=(3,), chunktype=numpy.ndarray>
>>> y = [[1, 2, 3], [4, 5, 6]] >>> da.asanyarray(y) dask.array<array, shape=(2, 3), dtype=int64, chunksize=(2, 3), chunktype=numpy.ndarray>

dask.array.
asarray
(a, **kwargs)¶ Convert the input to a dask array.
Parameters:  a : arraylike
Input data, in any form that can be converted to a dask array.
Returns:  out : dask array
Dask array interpretation of a.
Examples
>>> import dask.array as da >>> import numpy as np >>> x = np.arange(3) >>> da.asarray(x) dask.array<array, shape=(3,), dtype=int64, chunksize=(3,), chunktype=numpy.ndarray>
>>> y = [[1, 2, 3], [4, 5, 6]] >>> da.asarray(y) dask.array<array, shape=(2, 3), dtype=int64, chunksize=(2, 3), chunktype=numpy.ndarray>

dask.array.
atleast_1d
(*arys)¶ Convert inputs to arrays with at least one dimension.
This docstring was copied from numpy.atleast_1d.
Some inconsistencies with the Dask version may exist.
Scalar inputs are converted to 1dimensional arrays, whilst higherdimensional inputs are preserved.
Parameters:  arys1, arys2, … : array_like
One or more input arrays.
Returns:  ret : ndarray
An array, or list of arrays, each with
a.ndim >= 1
. Copies are made only if necessary.
See also
Examples
>>> np.atleast_1d(1.0) # doctest: +SKIP array([ 1.])
>>> x = np.arange(9.0).reshape(3,3) # doctest: +SKIP >>> np.atleast_1d(x) # doctest: +SKIP array([[ 0., 1., 2.], [ 3., 4., 5.], [ 6., 7., 8.]]) >>> np.atleast_1d(x) is x # doctest: +SKIP True
>>> np.atleast_1d(1, [3, 4]) # doctest: +SKIP [array([1]), array([3, 4])]

dask.array.
atleast_2d
(*arys)¶ View inputs as arrays with at least two dimensions.
This docstring was copied from numpy.atleast_2d.
Some inconsistencies with the Dask version may exist.
Parameters:  arys1, arys2, … : array_like
One or more arraylike sequences. Nonarray inputs are converted to arrays. Arrays that already have two or more dimensions are preserved.
Returns:  res, res2, … : ndarray
An array, or list of arrays, each with
a.ndim >= 2
. Copies are avoided where possible, and views with two or more dimensions are returned.
See also
Examples
>>> np.atleast_2d(3.0) # doctest: +SKIP array([[ 3.]])
>>> x = np.arange(3.0) # doctest: +SKIP >>> np.atleast_2d(x) # doctest: +SKIP array([[ 0., 1., 2.]]) >>> np.atleast_2d(x).base is x # doctest: +SKIP True
>>> np.atleast_2d(1, [1, 2], [[1, 2]]) # doctest: +SKIP [array([[1]]), array([[1, 2]]), array([[1, 2]])]

dask.array.
atleast_3d
(*arys)¶ View inputs as arrays with at least three dimensions.
This docstring was copied from numpy.atleast_3d.
Some inconsistencies with the Dask version may exist.
Parameters:  arys1, arys2, … : array_like
One or more arraylike sequences. Nonarray inputs are converted to arrays. Arrays that already have three or more dimensions are preserved.
Returns:  res1, res2, … : ndarray
An array, or list of arrays, each with
a.ndim >= 3
. Copies are avoided where possible, and views with three or more dimensions are returned. For example, a 1D array of shape(N,)
becomes a view of shape(1, N, 1)
, and a 2D array of shape(M, N)
becomes a view of shape(M, N, 1)
.
See also
Examples
>>> np.atleast_3d(3.0) # doctest: +SKIP array([[[ 3.]]])
>>> x = np.arange(3.0) # doctest: +SKIP >>> np.atleast_3d(x).shape # doctest: +SKIP (1, 3, 1)
>>> x = np.arange(12.0).reshape(4,3) # doctest: +SKIP >>> np.atleast_3d(x).shape # doctest: +SKIP (4, 3, 1) >>> np.atleast_3d(x).base is x.base # x is a reshape, so not base itself # doctest: +SKIP True
>>> for arr in np.atleast_3d([1, 2], [[1, 2]], [[[1, 2]]]): # doctest: +SKIP ... print(arr, arr.shape) ... [[[1] [2]]] (1, 2, 1) [[[1] [2]]] (1, 2, 1) [[[1 2]]] (1, 1, 2)

dask.array.
average
(a, axis=None, weights=None, returned=False)¶ Compute the weighted average along the specified axis.
This docstring was copied from numpy.average.
Some inconsistencies with the Dask version may exist.
Parameters:  a : array_like
Array containing data to be averaged. If a is not an array, a conversion is attempted.
 axis : None or int or tuple of ints, optional
Axis or axes along which to average a. The default, axis=None, will average over all of the elements of the input array. If axis is negative it counts from the last to the first axis.
New in version 1.7.0.
If axis is a tuple of ints, averaging is performed on all of the axes specified in the tuple instead of a single axis or all the axes as before.
 weights : array_like, optional
An array of weights associated with the values in a. Each value in a contributes to the average according to its associated weight. The weights array can either be 1D (in which case its length must be the size of a along the given axis) or of the same shape as a. If weights=None, then all data in a are assumed to have a weight equal to one.
 returned : bool, optional
Default is False. If True, the tuple (average, sum_of_weights) is returned, otherwise only the average is returned. If weights=None, sum_of_weights is equivalent to the number of elements over which the average is taken.
Returns:  retval, [sum_of_weights] : array_type or double
Return the average along the specified axis. When returned is True, return a tuple with the average as the first element and the sum of the weights as the second element. sum_of_weights is of the same type as retval. The result dtype follows a genereal pattern. If weights is None, the result dtype will be that of a , or
float64
if a is integral. Otherwise, if weights is not None and a is non integral, the result type will be the type of lowest precision capable of representing values of both a and weights. If a happens to be integral, the previous rules still applies but the result dtype will at least befloat64
.
Raises:  ZeroDivisionError
When all weights along axis are zero. See numpy.ma.average for a version robust to this type of error.
 TypeError
When the length of 1D weights is not the same as the shape of a along axis.
See also
ma.average
 average for masked arrays – useful if your data contains “missing” values
numpy.result_type
 Returns the type that results from applying the numpy type promotion rules to the arguments.
Examples
>>> data = range(1,5) # doctest: +SKIP >>> data # doctest: +SKIP [1, 2, 3, 4] >>> np.average(data) # doctest: +SKIP 2.5 >>> np.average(range(1,11), weights=range(10,0,1)) # doctest: +SKIP 4.0
>>> data = np.arange(6).reshape((3,2)) # doctest: +SKIP >>> data # doctest: +SKIP array([[0, 1], [2, 3], [4, 5]]) >>> np.average(data, axis=1, weights=[1./4, 3./4]) # doctest: +SKIP array([ 0.75, 2.75, 4.75]) >>> np.average(data, weights=[1./4, 3./4]) # doctest: +SKIP
Traceback (most recent call last): … TypeError: Axis must be specified when shapes of a and weights differ.
>>> a = np.ones(5, dtype=np.float128) # doctest: +SKIP >>> w = np.ones(5, dtype=np.complex64) # doctest: +SKIP >>> avg = np.average(a, weights=w) # doctest: +SKIP >>> print(avg.dtype) # doctest: +SKIP complex256

dask.array.
bincount
(x, weights=None, minlength=0)¶ This docstring was copied from numpy.bincount.
Some inconsistencies with the Dask version may exist.
Count number of occurrences of each value in array of nonnegative ints.
The number of bins (of size 1) is one larger than the largest value in x. If minlength is specified, there will be at least this number of bins in the output array (though it will be longer if necessary, depending on the contents of x). Each bin gives the number of occurrences of its index value in x. If weights is specified the input array is weighted by it, i.e. if a value
n
is found at positioni
,out[n] += weight[i]
instead ofout[n] += 1
.Parameters:  x : array_like, 1 dimension, nonnegative ints
Input array.
 weights : array_like, optional
Weights, array of the same shape as x.
 minlength : int, optional
A minimum number of bins for the output array.
New in version 1.6.0.
Returns:  out : ndarray of ints
The result of binning the input array. The length of out is equal to
np.amax(x)+1
.
Raises:  ValueError
If the input is not 1dimensional, or contains elements with negative values, or if minlength is negative.
 TypeError
If the type of the input is float or complex.
Examples
>>> np.bincount(np.arange(5)) # doctest: +SKIP array([1, 1, 1, 1, 1]) >>> np.bincount(np.array([0, 1, 1, 3, 2, 1, 7])) # doctest: +SKIP array([1, 3, 1, 1, 0, 0, 0, 1])
>>> x = np.array([0, 1, 1, 3, 2, 1, 7, 23]) # doctest: +SKIP >>> np.bincount(x).size == np.amax(x)+1 # doctest: +SKIP True
The input array needs to be of integer dtype, otherwise a TypeError is raised:
>>> np.bincount(np.arange(5, dtype=float)) # doctest: +SKIP Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: array cannot be safely cast to required type
A possible use of
bincount
is to perform sums over variablesize chunks of an array, using theweights
keyword.>>> w = np.array([0.3, 0.5, 0.2, 0.7, 1., 0.6]) # weights # doctest: +SKIP >>> x = np.array([0, 1, 1, 2, 2, 2]) # doctest: +SKIP >>> np.bincount(x, weights=w) # doctest: +SKIP array([ 0.3, 0.7, 1.1])

dask.array.
bitwise_and
(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Compute the bitwise AND of two arrays elementwise.
Computes the bitwise AND of the underlying binary representation of the integers in the input arrays. This ufunc implements the C/Python operator
&
.Parameters:  x1, x2 : array_like
Only integer and boolean types are handled.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  out : ndarray or scalar
Result. This is a scalar if both x1 and x2 are scalars.
See also
logical_and
,bitwise_or
,bitwise_xor
binary_repr
 Return the binary representation of the input number as a string.
Examples
The number 13 is represented by
00001101
. Likewise, 17 is represented by00010001
. The bitwise AND of 13 and 17 is therefore000000001
, or 1:>>> np.bitwise_and(13, 17) # doctest: +SKIP 1
>>> np.bitwise_and(14, 13) # doctest: +SKIP 12 >>> np.binary_repr(12) # doctest: +SKIP '1100' >>> np.bitwise_and([14,3], 13) # doctest: +SKIP array([12, 1])
>>> np.bitwise_and([11,7], [4,25]) # doctest: +SKIP array([0, 1]) >>> np.bitwise_and(np.array([2,5,255]), np.array([3,14,16])) # doctest: +SKIP array([ 2, 4, 16]) >>> np.bitwise_and([True, True], [False, True]) # doctest: +SKIP array([False, True])

dask.array.
bitwise_not
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Compute bitwise inversion, or bitwise NOT, elementwise.
Computes the bitwise NOT of the underlying binary representation of the integers in the input arrays. This ufunc implements the C/Python operator
~
.For signed integer inputs, the two’s complement is returned. In a two’scomplement system negative numbers are represented by the two’s complement of the absolute value. This is the most common method of representing signed integers on computers [1]. A Nbit two’scomplement system can represent every integer in the range \(2^{N1}\) to \(+2^{N1}1\).
Parameters:  x : array_like
Only integer and boolean types are handled.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  out : ndarray or scalar
Result. This is a scalar if x is a scalar.
See also
bitwise_and
,bitwise_or
,bitwise_xor
,logical_not
binary_repr
 Return the binary representation of the input number as a string.
Notes
bitwise_not is an alias for invert:
>>> np.bitwise_not is np.invert # doctest: +SKIP True
References
[1] (1, 2) Wikipedia, “Two’s complement”, https://en.wikipedia.org/wiki/Two’s_complement Examples
We’ve seen that 13 is represented by
00001101
. The invert or bitwise NOT of 13 is then:>>> np.invert(np.array([13], dtype=uint8)) # doctest: +SKIP array([242], dtype=uint8) >>> np.binary_repr(x, width=8) # doctest: +SKIP '00001101' >>> np.binary_repr(242, width=8) # doctest: +SKIP '11110010'
The result depends on the bitwidth:
>>> np.invert(np.array([13], dtype=uint16)) # doctest: +SKIP array([65522], dtype=uint16) >>> np.binary_repr(x, width=16) # doctest: +SKIP '0000000000001101' >>> np.binary_repr(65522, width=16) # doctest: +SKIP '1111111111110010'
When using signed integer types the result is the two’s complement of the result for the unsigned type:
>>> np.invert(np.array([13], dtype=int8)) # doctest: +SKIP array([14], dtype=int8) >>> np.binary_repr(14, width=8) # doctest: +SKIP '11110010'
Booleans are accepted as well:
>>> np.invert(array([True, False])) # doctest: +SKIP array([False, True])

dask.array.
bitwise_or
(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Compute the bitwise OR of two arrays elementwise.
Computes the bitwise OR of the underlying binary representation of the integers in the input arrays. This ufunc implements the C/Python operator

.Parameters:  x1, x2 : array_like
Only integer and boolean types are handled.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  out : ndarray or scalar
Result. This is a scalar if both x1 and x2 are scalars.
See also
logical_or
,bitwise_and
,bitwise_xor
binary_repr
 Return the binary representation of the input number as a string.
Examples
The number 13 has the binaray representation
00001101
. Likewise, 16 is represented by00010000
. The bitwise OR of 13 and 16 is then000111011
, or 29:>>> np.bitwise_or(13, 16) # doctest: +SKIP 29 >>> np.binary_repr(29) # doctest: +SKIP '11101'
>>> np.bitwise_or(32, 2) # doctest: +SKIP 34 >>> np.bitwise_or([33, 4], 1) # doctest: +SKIP array([33, 5]) >>> np.bitwise_or([33, 4], [1, 2]) # doctest: +SKIP array([33, 6])
>>> np.bitwise_or(np.array([2, 5, 255]), np.array([4, 4, 4])) # doctest: +SKIP array([ 6, 5, 255]) >>> np.array([2, 5, 255])  np.array([4, 4, 4]) # doctest: +SKIP array([ 6, 5, 255]) >>> np.bitwise_or(np.array([2, 5, 255, 2147483647L], dtype=np.int32), # doctest: +SKIP ... np.array([4, 4, 4, 2147483647L], dtype=np.int32)) array([ 6, 5, 255, 2147483647]) >>> np.bitwise_or([True, True], [False, True]) # doctest: +SKIP array([ True, True])

dask.array.
bitwise_xor
(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Compute the bitwise XOR of two arrays elementwise.
Computes the bitwise XOR of the underlying binary representation of the integers in the input arrays. This ufunc implements the C/Python operator
^
.Parameters:  x1, x2 : array_like
Only integer and boolean types are handled.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  out : ndarray or scalar
Result. This is a scalar if both x1 and x2 are scalars.
See also
logical_xor
,bitwise_and
,bitwise_or
binary_repr
 Return the binary representation of the input number as a string.
Examples
The number 13 is represented by
00001101
. Likewise, 17 is represented by00010001
. The bitwise XOR of 13 and 17 is therefore00011100
, or 28:>>> np.bitwise_xor(13, 17) # doctest: +SKIP 28 >>> np.binary_repr(28) # doctest: +SKIP '11100'
>>> np.bitwise_xor(31, 5) # doctest: +SKIP 26 >>> np.bitwise_xor([31,3], 5) # doctest: +SKIP array([26, 6])
>>> np.bitwise_xor([31,3], [5,6]) # doctest: +SKIP array([26, 5]) >>> np.bitwise_xor([True, True], [False, True]) # doctest: +SKIP array([ True, False])

dask.array.
block
(arrays, allow_unknown_chunksizes=False)¶ Assemble an ndarray from nested lists of blocks.
Blocks in the innermost lists are concatenated along the last dimension (1), then these are concatenated along the secondlast dimension (2), and so on until the outermost list is reached
Blocks can be of any dimension, but will not be broadcasted using the normal rules. Instead, leading axes of size 1 are inserted, to make
block.ndim
the same for all blocks. This is primarily useful for working with scalars, and means that code likeblock([v, 1])
is valid, wherev.ndim == 1
.When the nested list is two levels deep, this allows block matrices to be constructed from their components.
Parameters:  arrays : nested list of array_like or scalars (but not tuples)
If passed a single ndarray or scalar (a nested list of depth 0), this is returned unmodified (and not copied).
Elements shapes must match along the appropriate axes (without broadcasting), but leading 1s will be prepended to the shape as necessary to make the dimensions match.
 allow_unknown_chunksizes: bool
Allow unknown chunksizes, such as come from converting from dask dataframes. Dask.array is unable to verify that chunks line up. If data comes from differently aligned sources then this can cause unexpected results.
Returns:  block_array : ndarray
The array assembled from the given blocks.
The dimensionality of the output is equal to the greatest of: * the dimensionality of all the inputs * the depth to which the input list is nested
Raises:  ValueError
 If list depths are mismatched  for instance,
[[a, b], c]
is illegal, and should be spelt[[a, b], [c]]
 If lists are empty  for instance,
[[a, b], []]
 If list depths are mismatched  for instance,
See also
concatenate
 Join a sequence of arrays together.
stack
 Stack arrays in sequence along a new dimension.
hstack
 Stack arrays in sequence horizontally (column wise).
vstack
 Stack arrays in sequence vertically (row wise).
dstack
 Stack arrays in sequence depth wise (along third dimension).
vsplit
 Split array into a list of multiple subarrays vertically.
Notes
When called with only scalars,
block
is equivalent to an ndarray call. Soblock([[1, 2], [3, 4]])
is equivalent toarray([[1, 2], [3, 4]])
.This function does not enforce that the blocks lie on a fixed grid.
block([[a, b], [c, d]])
is not restricted to arrays of the form:AAAbb AAAbb cccDD
But is also allowed to produce, for some
a, b, c, d
:AAAbb AAAbb cDDDD
Since concatenation happens along the last axis first, block is _not_ capable of producing the following directly:
AAAbb cccbb cccDD
Matlab’s “square bracket stacking”,
[A, B, ...; p, q, ...]
, is equivalent toblock([[A, B, ...], [p, q, ...]])
.

dask.array.
blockwise
(func, out_ind, *args, name=None, token=None, dtype=None, adjust_chunks=None, new_axes=None, align_arrays=True, concatenate=None, meta=None, **kwargs)¶ Tensor operation: Generalized inner and outer products
A broad class of blocked algorithms and patterns can be specified with a concise multiindex notation. The
blockwise
function applies an inmemory function across multiple blocks of multiple inputs in a variety of ways. Many dask.array operations are special cases of blockwise including elementwise, broadcasting, reductions, tensordot, and transpose.Parameters:  func : callable
Function to apply to individual tuples of blocks
 out_ind : iterable
Block pattern of the output, something like ‘ijk’ or (1, 2, 3)
 *args : sequence of Array, index pairs
Sequence like (x, ‘ij’, y, ‘jk’, z, ‘i’)
 **kwargs : dict
Extra keyword arguments to pass to function
 dtype : np.dtype
Datatype of resulting array.
 concatenate : bool, keyword only
If true concatenate arrays along dummy indices, else provide lists
 adjust_chunks : dict
Dictionary mapping index to function to be applied to chunk sizes
 new_axes : dict, keyword only
New indexes and their dimension lengths
Examples
2D embarrassingly parallel operation from two arrays, x, and y.
>>> z = blockwise(operator.add, 'ij', x, 'ij', y, 'ij', dtype='f8') # z = x + y # doctest: +SKIP
Outer product multiplying x by y, two 1d vectors
>>> z = blockwise(operator.mul, 'ij', x, 'i', y, 'j', dtype='f8') # doctest: +SKIP
z = x.T
>>> z = blockwise(np.transpose, 'ji', x, 'ij', dtype=x.dtype) # doctest: +SKIP
The transpose case above is illustrative because it does same transposition both on each inmemory block by calling
np.transpose
and on the order of the blocks themselves, by switching the order of the indexij > ji
.We can compose these same patterns with more variables and more complex inmemory functions
z = X + Y.T
>>> z = blockwise(lambda x, y: x + y.T, 'ij', x, 'ij', y, 'ji', dtype='f8') # doctest: +SKIP
Any index, like
i
missing from the output index is interpreted as a contraction (note that this differs from Einstein convention; repeated indices do not imply contraction.) In the case of a contraction the passed function should expect an iterable of blocks on any array that holds that index. To receive arrays concatenated along contracted dimensions instead passconcatenate=True
.Inner product multiplying x by y, two 1d vectors
>>> def sequence_dot(x_blocks, y_blocks): ... result = 0 ... for x, y in zip(x_blocks, y_blocks): ... result += x.dot(y) ... return result
>>> z = blockwise(sequence_dot, '', x, 'i', y, 'i', dtype='f8') # doctest: +SKIP
Add new singlechunk dimensions with the
new_axes=
keyword, including the length of the new dimension. New dimensions will always be in a single chunk.>>> def f(x): ... return x[:, None] * np.ones((1, 5))
>>> z = blockwise(f, 'az', x, 'a', new_axes={'z': 5}, dtype=x.dtype) # doctest: +SKIP
New dimensions can also be multichunk by specifying a tuple of chunk sizes. This has limited utility as is (because the chunks are all the same), but the resulting graph can be modified to achieve more useful results (see
da.map_blocks
).>>> z = blockwise(f, 'az', x, 'a', new_axes={'z': (5, 5)}, dtype=x.dtype) # doctest: +SKIP
If the applied function changes the size of each chunk you can specify this with a
adjust_chunks={...}
dictionary holding a function for each index that modifies the dimension size in that index.>>> def double(x): ... return np.concatenate([x, x])
>>> y = blockwise(double, 'ij', x, 'ij', ... adjust_chunks={'i': lambda n: 2 * n}, dtype=x.dtype) # doctest: +SKIP
Include literals by indexing with None
>>> y = blockwise(add, 'ij', x, 'ij', 1234, None, dtype=x.dtype) # doctest: +SKIP

dask.array.
broadcast_arrays
(*args, **kwargs)¶ Broadcast any number of arrays against each other.
This docstring was copied from numpy.broadcast_arrays.
Some inconsistencies with the Dask version may exist.
Parameters:  `*args` : array_likes
The arrays to broadcast.
 subok : bool, optional
If True, then subclasses will be passedthrough, otherwise the returned arrays will be forced to be a baseclass array (default).
Returns:  broadcasted : list of arrays
These arrays are views on the original arrays. They are typically not contiguous. Furthermore, more than one element of a broadcasted array may refer to a single memory location. If you need to write to the arrays, make copies first.
Examples
>>> x = np.array([[1,2,3]]) # doctest: +SKIP >>> y = np.array([[4],[5]]) # doctest: +SKIP >>> np.broadcast_arrays(x, y) # doctest: +SKIP [array([[1, 2, 3], [1, 2, 3]]), array([[4, 4, 4], [5, 5, 5]])]
Here is a useful idiom for getting contiguous copies instead of noncontiguous views.
>>> [np.array(a) for a in np.broadcast_arrays(x, y)] # doctest: +SKIP [array([[1, 2, 3], [1, 2, 3]]), array([[4, 4, 4], [5, 5, 5]])]

dask.array.
broadcast_to
(x, shape, chunks=None)¶ Broadcast an array to a new shape.
Parameters:  x : array_like
The array to broadcast.
 shape : tuple
The shape of the desired array.
 chunks : tuple, optional
If provided, then the result will use these chunks instead of the same chunks as the source array. Setting chunks explicitly as part of broadcast_to is more efficient than rechunking afterwards. Chunks are only allowed to differ from the original shape along dimensions that are new on the result or have size 1 the input array.
Returns:  broadcast : dask array
See also

dask.array.
coarsen
(reduction, x, axes, trim_excess=False) Coarsen array by applying reduction to fixed size neighborhoods
Parameters:  reduction: function
Function like np.sum, np.mean, etc…
 x: np.ndarray
Array to be coarsened
 axes: dict
Mapping of axis to coarsening factor
Examples
>>> x = np.array([1, 2, 3, 4, 5, 6]) >>> coarsen(np.sum, x, {0: 2}) array([ 3, 7, 11]) >>> coarsen(np.max, x, {0: 3}) array([3, 6])
Provide dictionary of scale per dimension
>>> x = np.arange(24).reshape((4, 6)) >>> x array([[ 0, 1, 2, 3, 4, 5], [ 6, 7, 8, 9, 10, 11], [12, 13, 14, 15, 16, 17], [18, 19, 20, 21, 22, 23]])
>>> coarsen(np.min, x, {0: 2, 1: 3}) array([[ 0, 3], [12, 15]])
You must avoid excess elements explicitly
>>> x = np.array([1, 2, 3, 4, 5, 6, 7, 8]) >>> coarsen(np.min, x, {0: 3}, trim_excess=True) array([1, 4])

dask.array.
ceil
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Return the ceiling of the input, elementwise.
The ceil of the scalar x is the smallest integer i, such that i >= x. It is often denoted as \(\lceil x \rceil\).
Parameters:  x : array_like
Input data.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  y : ndarray or scalar
The ceiling of each element in x, with float dtype. This is a scalar if x is a scalar.
Examples
>>> a = np.array([1.7, 1.5, 0.2, 0.2, 1.5, 1.7, 2.0]) # doctest: +SKIP >>> np.ceil(a) # doctest: +SKIP array([1., 1., 0., 1., 2., 2., 2.])

dask.array.
choose
(a, choices)¶ Construct an array from an index array and a set of arrays to choose from.
This docstring was copied from numpy.choose.
Some inconsistencies with the Dask version may exist.
First of all, if confused or uncertain, definitely look at the Examples  in its full generality, this function is less simple than it might seem from the following code description (below ndi = numpy.lib.index_tricks):
np.choose(a,c) == np.array([c[a[I]][I] for I in ndi.ndindex(a.shape)])
.But this omits some subtleties. Here is a fully general summary:
Given an “index” array (a) of integers and a sequence of n arrays (choices), a and each choice array are first broadcast, as necessary, to arrays of a common shape; calling these Ba and Bchoices[i], i = 0,…,n1 we have that, necessarily,
Ba.shape == Bchoices[i].shape
for each i. Then, a new array with shapeBa.shape
is created as follows: if
mode=raise
(the default), then, first of all, each element of a (and thus Ba) must be in the range [0, n1]; now, suppose that i (in that range) is the value at the (j0, j1, …, jm) position in Ba  then the value at the same position in the new array is the value in Bchoices[i] at that same position;  if
mode=wrap
, values in a (and thus Ba) may be any (signed) integer; modular arithmetic is used to map integers outside the range [0, n1] back into that range; and then the new array is constructed as above;  if
mode=clip
, values in a (and thus Ba) may be any (signed) integer; negative integers are mapped to 0; values greater than n1 are mapped to n1; and then the new array is constructed as above.
Parameters:  a : int array
This array must contain integers in [0, n1], where n is the number of choices, unless
mode=wrap
ormode=clip
, in which cases any integers are permissible. choices : sequence of arrays
Choice arrays. a and all of the choices must be broadcastable to the same shape. If choices is itself an array (not recommended), then its outermost dimension (i.e., the one corresponding to
choices.shape[0]
) is taken as defining the “sequence”. out : array, optional (Not supported in Dask)
If provided, the result will be inserted into this array. It should be of the appropriate shape and dtype.
 mode : {‘raise’ (default), ‘wrap’, ‘clip’}, optional (Not supported in Dask)
Specifies how indices outside [0, n1] will be treated:
 ‘raise’ : an exception is raised
 ‘wrap’ : value becomes value mod n
 ‘clip’ : values < 0 are mapped to 0, values > n1 are mapped to n1
Returns:  merged_array : array
The merged result.
Raises:  ValueError: shape mismatch
If a and each choice array are not all broadcastable to the same shape.
See also
ndarray.choose
 equivalent method
Notes
To reduce the chance of misinterpretation, even though the following “abuse” is nominally supported, choices should neither be, nor be thought of as, a single array, i.e., the outermost sequencelike container should be either a list or a tuple.
Examples
>>> choices = [[0, 1, 2, 3], [10, 11, 12, 13], # doctest: +SKIP ... [20, 21, 22, 23], [30, 31, 32, 33]] >>> np.choose([2, 3, 1, 0], choices # doctest: +SKIP ... # the first element of the result will be the first element of the ... # third (2+1) "array" in choices, namely, 20; the second element ... # will be the second element of the fourth (3+1) choice array, i.e., ... # 31, etc. ... ) array([20, 31, 12, 3]) >>> np.choose([2, 4, 1, 0], choices, mode='clip') # 4 goes to 3 (41) # doctest: +SKIP array([20, 31, 12, 3]) >>> # because there are 4 choice arrays >>> np.choose([2, 4, 1, 0], choices, mode='wrap') # 4 goes to (4 mod 4) # doctest: +SKIP array([20, 1, 12, 3]) >>> # i.e., 0
A couple examples illustrating how choose broadcasts:
>>> a = [[1, 0, 1], [0, 1, 0], [1, 0, 1]] # doctest: +SKIP >>> choices = [10, 10] # doctest: +SKIP >>> np.choose(a, choices) # doctest: +SKIP array([[ 10, 10, 10], [10, 10, 10], [ 10, 10, 10]])
>>> # With thanks to Anne Archibald >>> a = np.array([0, 1]).reshape((2,1,1)) # doctest: +SKIP >>> c1 = np.array([1, 2, 3]).reshape((1,3,1)) # doctest: +SKIP >>> c2 = np.array([1, 2, 3, 4, 5]).reshape((1,1,5)) # doctest: +SKIP >>> np.choose(a, (c1, c2)) # result is 2x3x5, res[0,:,:]=c1, res[1,:,:]=c2 # doctest: +SKIP array([[[ 1, 1, 1, 1, 1], [ 2, 2, 2, 2, 2], [ 3, 3, 3, 3, 3]], [[1, 2, 3, 4, 5], [1, 2, 3, 4, 5], [1, 2, 3, 4, 5]]])
 if

dask.array.
clip
(*args, **kwargs)¶ Clip (limit) the values in an array.
Given an interval, values outside the interval are clipped to the interval edges. For example, if an interval of
[0, 1]
is specified, values smaller than 0 become 0, and values larger than 1 become 1.Parameters:  a : array_like
Array containing elements to clip.
 a_min : scalar or array_like or None
Minimum value. If None, clipping is not performed on lower interval edge. Not more than one of a_min and a_max may be None.
 a_max : scalar or array_like or None
Maximum value. If None, clipping is not performed on upper interval edge. Not more than one of a_min and a_max may be None. If a_min or a_max are array_like, then the three arrays will be broadcasted to match their shapes.
 out : ndarray, optional
The results will be placed in this array. It may be the input array for inplace clipping. out must be of the right shape to hold the output. Its type is preserved.
Returns:  clipped_array : ndarray
An array with the elements of a, but where values < a_min are replaced with a_min, and those > a_max with a_max.
See also
numpy.doc.ufuncs
 Section “Output arguments”
Examples
>>> a = np.arange(10) # doctest: +SKIP >>> np.clip(a, 1, 8) # doctest: +SKIP array([1, 1, 2, 3, 4, 5, 6, 7, 8, 8]) >>> a # doctest: +SKIP array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> np.clip(a, 3, 6, out=a) # doctest: +SKIP array([3, 3, 3, 3, 4, 5, 6, 6, 6, 6]) >>> a = np.arange(10) # doctest: +SKIP >>> a # doctest: +SKIP array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> np.clip(a, [3, 4, 1, 1, 1, 4, 4, 4, 4, 4], 8) # doctest: +SKIP array([3, 4, 2, 3, 4, 5, 6, 7, 8, 8])

dask.array.
compress
(condition, a, axis=None)¶ Return selected slices of an array along given axis.
This docstring was copied from numpy.compress.
Some inconsistencies with the Dask version may exist.
When working along a given axis, a slice along that axis is returned in output for each index where condition evaluates to True. When working on a 1D array, compress is equivalent to extract.
Parameters:  condition : 1D array of bools
Array that selects which entries to return. If len(condition) is less than the size of a along the given axis, then output is truncated to the length of the condition array.
 a : array_like
Array from which to extract a part.
 axis : int, optional
Axis along which to take slices. If None (default), work on the flattened array.
 out : ndarray, optional (Not supported in Dask)
Output array. Its type is preserved and it must be of the right shape to hold the output.
Returns:  compressed_array : ndarray
A copy of a without the slices along axis for which condition is false.
See also
take
,choose
,diag
,diagonal
,select
ndarray.compress
 Equivalent method in ndarray
np.extract
 Equivalent method when working on 1D arrays
numpy.doc.ufuncs
 Section “Output arguments”
Examples
>>> a = np.array([[1, 2], [3, 4], [5, 6]]) # doctest: +SKIP >>> a # doctest: +SKIP array([[1, 2], [3, 4], [5, 6]]) >>> np.compress([0, 1], a, axis=0) # doctest: +SKIP array([[3, 4]]) >>> np.compress([False, True, True], a, axis=0) # doctest: +SKIP array([[3, 4], [5, 6]]) >>> np.compress([False, True], a, axis=1) # doctest: +SKIP array([[2], [4], [6]])
Working on the flattened array does not return slices along an axis but selects elements.
>>> np.compress([False, True], a) # doctest: +SKIP array([2])

dask.array.
concatenate
(seq, axis=0, allow_unknown_chunksizes=False) Concatenate arrays along an existing axis
Given a sequence of dask Arrays form a new dask Array by stacking them along an existing dimension (axis=0 by default)
Parameters:  seq: list of dask.arrays
 axis: int
Dimension along which to align all of the arrays
 allow_unknown_chunksizes: bool
Allow unknown chunksizes, such as come from converting from dask dataframes. Dask.array is unable to verify that chunks line up. If data comes from differently aligned sources then this can cause unexpected results.
See also
Examples
Create slices
>>> import dask.array as da >>> import numpy as np
>>> data = [from_array(np.ones((4, 4)), chunks=(2, 2)) ... for i in range(3)]
>>> x = da.concatenate(data, axis=0) >>> x.shape (12, 4)
>>> da.concatenate(data, axis=1).shape (4, 12)
Result is a new dask Array

dask.array.
conj
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Return the complex conjugate, elementwise.
The complex conjugate of a complex number is obtained by changing the sign of its imaginary part.
Parameters:  x : array_like
Input value.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  y : ndarray
The complex conjugate of x, with same dtype as y. This is a scalar if x is a scalar.
Examples
>>> np.conjugate(1+2j) # doctest: +SKIP (12j)
>>> x = np.eye(2) + 1j * np.eye(2) # doctest: +SKIP >>> np.conjugate(x) # doctest: +SKIP array([[ 1.1.j, 0.0.j], [ 0.0.j, 1.1.j]])

dask.array.
copysign
(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Change the sign of x1 to that of x2, elementwise.
If both arguments are arrays or sequences, they have to be of the same length. If x2 is a scalar, its sign will be copied to all elements of x1.
Parameters:  x1 : array_like
Values to change the sign of.
 x2 : array_like
The sign of x2 is copied to x1.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  out : ndarray or scalar
The values of x1 with the sign of x2. This is a scalar if both x1 and x2 are scalars.
Examples
>>> np.copysign(1.3, 1) # doctest: +SKIP 1.3 >>> 1/np.copysign(0, 1) # doctest: +SKIP inf >>> 1/np.copysign(0, 1) # doctest: +SKIP inf
>>> np.copysign([1, 0, 1], 1.1) # doctest: +SKIP array([1., 0., 1.]) >>> np.copysign([1, 0, 1], np.arange(3)1) # doctest: +SKIP array([1., 0., 1.])

dask.array.
corrcoef
(x, y=None, rowvar=1)¶ Return Pearson productmoment correlation coefficients.
This docstring was copied from numpy.corrcoef.
Some inconsistencies with the Dask version may exist.
Please refer to the documentation for cov for more detail. The relationship between the correlation coefficient matrix, R, and the covariance matrix, C, is
\[R_{ij} = \frac{ C_{ij} } { \sqrt{ C_{ii} * C_{jj} } }\]The values of R are between 1 and 1, inclusive.
Parameters:  x : array_like
A 1D or 2D array containing multiple variables and observations. Each row of x represents a variable, and each column a single observation of all those variables. Also see rowvar below.
 y : array_like, optional
An additional set of variables and observations. y has the same shape as x.
 rowvar : bool, optional
If rowvar is True (default), then each row represents a variable, with observations in the columns. Otherwise, the relationship is transposed: each column represents a variable, while the rows contain observations.
 bias : _NoValue, optional (Not supported in Dask)
Has no effect, do not use.
Deprecated since version 1.10.0.
 ddof : _NoValue, optional (Not supported in Dask)
Has no effect, do not use.
Deprecated since version 1.10.0.
Returns:  R : ndarray
The correlation coefficient matrix of the variables.
See also
cov
 Covariance matrix
Notes
Due to floating point rounding the resulting array may not be Hermitian, the diagonal elements may not be 1, and the elements may not satisfy the inequality abs(a) <= 1. The real and imaginary parts are clipped to the interval [1, 1] in an attempt to improve on that situation but is not much help in the complex case.
This function accepts but discards arguments bias and ddof. This is for backwards compatibility with previous versions of this function. These arguments had no effect on the return values of the function and can be safely ignored in this and previous versions of numpy.

dask.array.
cos
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Cosine elementwise.
Parameters:  x : array_like
Input array in radians.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  y : ndarray
The corresponding cosine values. This is a scalar if x is a scalar.
Notes
If out is provided, the function writes the result into it, and returns a reference to out. (See Examples)
References
M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions. New York, NY: Dover, 1972.
Examples
>>> np.cos(np.array([0, np.pi/2, np.pi])) # doctest: +SKIP array([ 1.00000000e+00, 6.12303177e17, 1.00000000e+00]) >>> >>> # Example of providing the optional output parameter >>> out2 = np.cos([0.1], out1) # doctest: +SKIP >>> out2 is out1 # doctest: +SKIP True >>> >>> # Example of ValueError due to provision of shape mismatched `out` >>> np.cos(np.zeros((3,3)),np.zeros((2,2))) # doctest: +SKIP Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: operands could not be broadcast together with shapes (3,3) (2,2)

dask.array.
cosh
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Hyperbolic cosine, elementwise.
Equivalent to
1/2 * (np.exp(x) + np.exp(x))
andnp.cos(1j*x)
.Parameters:  x : array_like
Input array.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  out : ndarray or scalar
Output array of same shape as x. This is a scalar if x is a scalar.
Examples
>>> np.cosh(0) # doctest: +SKIP 1.0
The hyperbolic cosine describes the shape of a hanging cable:
>>> import matplotlib.pyplot as plt # doctest: +SKIP >>> x = np.linspace(4, 4, 1000) # doctest: +SKIP >>> plt.plot(x, np.cosh(x)) # doctest: +SKIP >>> plt.show() # doctest: +SKIP

dask.array.
count_nonzero
(a, axis=None)¶ Counts the number of nonzero values in the array
a
.This docstring was copied from numpy.count_nonzero.
Some inconsistencies with the Dask version may exist.
The word “nonzero” is in reference to the Python 2.x builtin method
__nonzero__()
(renamed__bool__()
in Python 3.x) of Python objects that tests an object’s “truthfulness”. For example, any number is considered truthful if it is nonzero, whereas any string is considered truthful if it is not the empty string. Thus, this function (recursively) counts how many elements ina
(and in subarrays thereof) have their__nonzero__()
or__bool__()
method evaluated toTrue
.Parameters:  a : array_like
The array for which to count nonzeros.
 axis : int or tuple, optional
Axis or tuple of axes along which to count nonzeros. Default is None, meaning that nonzeros will be counted along a flattened version of
a
.New in version 1.12.0.
Returns:  count : int or array of int
Number of nonzero values in the array along a given axis. Otherwise, the total number of nonzero values in the array is returned.
See also
nonzero
 Return the coordinates of all the nonzero values.
Examples
>>> np.count_nonzero(np.eye(4)) # doctest: +SKIP 4 >>> np.count_nonzero([[0,1,7,0,0],[3,0,0,2,19]]) # doctest: +SKIP 5 >>> np.count_nonzero([[0,1,7,0,0],[3,0,0,2,19]], axis=0) # doctest: +SKIP array([1, 1, 1, 1, 1]) >>> np.count_nonzero([[0,1,7,0,0],[3,0,0,2,19]], axis=1) # doctest: +SKIP array([2, 3])

dask.array.
cov
(m, y=None, rowvar=1, bias=0, ddof=None)¶ Estimate a covariance matrix, given data and weights.
This docstring was copied from numpy.cov.
Some inconsistencies with the Dask version may exist.
Covariance indicates the level to which two variables vary together. If we examine Ndimensional samples, \(X = [x_1, x_2, ... x_N]^T\), then the covariance matrix element \(C_{ij}\) is the covariance of \(x_i\) and \(x_j\). The element \(C_{ii}\) is the variance of \(x_i\).
See the notes for an outline of the algorithm.
Parameters:  m : array_like
A 1D or 2D array containing multiple variables and observations. Each row of m represents a variable, and each column a single observation of all those variables. Also see rowvar below.
 y : array_like, optional
An additional set of variables and observations. y has the same form as that of m.
 rowvar : bool, optional
If rowvar is True (default), then each row represents a variable, with observations in the columns. Otherwise, the relationship is transposed: each column represents a variable, while the rows contain observations.
 bias : bool, optional
Default normalization (False) is by
(N  1)
, whereN
is the number of observations given (unbiased estimate). If bias is True, then normalization is byN
. These values can be overridden by using the keywordddof
in numpy versions >= 1.5. ddof : int, optional
If not
None
the default value implied by bias is overridden. Note thatddof=1
will return the unbiased estimate, even if both fweights and aweights are specified, andddof=0
will return the simple average. See the notes for the details. The default value isNone
.New in version 1.5.
 fweights : array_like, int, optional (Not supported in Dask)
1D array of integer frequency weights; the number of times each observation vector should be repeated.
New in version 1.10.
 aweights : array_like, optional (Not supported in Dask)
1D array of observation vector weights. These relative weights are typically large for observations considered “important” and smaller for observations considered less “important”. If
ddof=0
the array of weights can be used to assign probabilities to observation vectors.New in version 1.10.
Returns:  out : ndarray
The covariance matrix of the variables.
See also
corrcoef
 Normalized covariance matrix
Notes
Assume that the observations are in the columns of the observation array m and let
f = fweights
anda = aweights
for brevity. The steps to compute the weighted covariance are as follows:>>> w = f * a >>> v1 = np.sum(w) >>> v2 = np.sum(w * a) >>> m = np.sum(m * w, axis=1, keepdims=True) / v1 >>> cov = np.dot(m * w, m.T) * v1 / (v1**2  ddof * v2)
Note that when
a == 1
, the normalization factorv1 / (v1**2  ddof * v2)
goes over to1 / (np.sum(f)  ddof)
as it should.Examples
Consider two variables, \(x_0\) and \(x_1\), which correlate perfectly, but in opposite directions:
>>> x = np.array([[0, 2], [1, 1], [2, 0]]).T # doctest: +SKIP >>> x # doctest: +SKIP array([[0, 1, 2], [2, 1, 0]])
Note how \(x_0\) increases while \(x_1\) decreases. The covariance matrix shows this clearly:
>>> np.cov(x) # doctest: +SKIP array([[ 1., 1.], [1., 1.]])
Note that element \(C_{0,1}\), which shows the correlation between \(x_0\) and \(x_1\), is negative.
Further, note how x and y are combined:
>>> x = [2.1, 1, 4.3] # doctest: +SKIP >>> y = [3, 1.1, 0.12] # doctest: +SKIP >>> X = np.stack((x, y), axis=0) # doctest: +SKIP >>> print(np.cov(X)) # doctest: +SKIP [[ 11.71 4.286 ] [ 4.286 2.14413333]] >>> print(np.cov(x, y)) # doctest: +SKIP [[ 11.71 4.286 ] [ 4.286 2.14413333]] >>> print(np.cov(x)) # doctest: +SKIP 11.71

dask.array.
cumprod
(a, axis=None, dtype=None, out=None)¶ Return the cumulative product of elements along a given axis.
Parameters:  a : array_like
Input array.
 axis : int, optional
Axis along which the cumulative product is computed. By default the input is flattened.
 dtype : dtype, optional
Type of the returned array, as well as of the accumulator in which the elements are multiplied. If dtype is not specified, it defaults to the dtype of a, unless a has an integer dtype with a precision less than that of the default platform integer. In that case, the default platform integer is used instead.
 out : ndarray, optional
Alternative output array in which to place the result. It must have the same shape and buffer length as the expected output but the type of the resulting values will be cast if necessary.
Returns:  cumprod : ndarray
A new array holding the result is returned unless out is specified, in which case a reference to out is returned.
See also
numpy.doc.ufuncs
 Section “Output arguments”
Notes
Arithmetic is modular when using integer types, and no error is raised on overflow.
Examples
>>> a = np.array([1,2,3]) >>> np.cumprod(a) # intermediate results 1, 1*2 ... # total product 1*2*3 = 6 array([1, 2, 6]) >>> a = np.array([[1, 2, 3], [4, 5, 6]]) >>> np.cumprod(a, dtype=float) # specify type of output array([ 1., 2., 6., 24., 120., 720.])
The cumulative product for each column (i.e., over the rows) of a:
>>> np.cumprod(a, axis=0) array([[ 1, 2, 3], [ 4, 10, 18]])
The cumulative product for each row (i.e. over the columns) of a:
>>> np.cumprod(a,axis=1) array([[ 1, 2, 6], [ 4, 20, 120]])

dask.array.
cumsum
(a, axis=None, dtype=None, out=None)¶ Return the cumulative sum of the elements along a given axis.
Parameters:  a : array_like
Input array.
 axis : int, optional
Axis along which the cumulative sum is computed. The default (None) is to compute the cumsum over the flattened array.
 dtype : dtype, optional
Type of the returned array and of the accumulator in which the elements are summed. If dtype is not specified, it defaults to the dtype of a, unless a has an integer dtype with a precision less than that of the default platform integer. In that case, the default platform integer is used.
 out : ndarray, optional
Alternative output array in which to place the result. It must have the same shape and buffer length as the expected output but the type will be cast if necessary. See doc.ufuncs (Section “Output arguments”) for more details.
Returns:  cumsum_along_axis : ndarray.
A new array holding the result is returned unless out is specified, in which case a reference to out is returned. The result has the same size as a, and the same shape as a if axis is not None or a is a 1d array.
See also
Notes
Arithmetic is modular when using integer types, and no error is raised on overflow.
Examples
>>> a = np.array([[1,2,3], [4,5,6]]) >>> a array([[1, 2, 3], [4, 5, 6]]) >>> np.cumsum(a) array([ 1, 3, 6, 10, 15, 21]) >>> np.cumsum(a, dtype=float) # specifies type of output value(s) array([ 1., 3., 6., 10., 15., 21.])
>>> np.cumsum(a,axis=0) # sum over rows for each of the 3 columns array([[1, 2, 3], [5, 7, 9]]) >>> np.cumsum(a,axis=1) # sum over columns for each of the 2 rows array([[ 1, 3, 6], [ 4, 9, 15]])

dask.array.
deg2rad
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Convert angles from degrees to radians.
Parameters:  x : array_like
Angles in degrees.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  y : ndarray
The corresponding angle in radians. This is a scalar if x is a scalar.
See also
rad2deg
 Convert angles from radians to degrees.
unwrap
 Remove large jumps in angle by wrapping.
Notes
New in version 1.3.0.
deg2rad(x)
isx * pi / 180
.Examples
>>> np.deg2rad(180) # doctest: +SKIP 3.1415926535897931

dask.array.
degrees
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Convert angles from radians to degrees.
Parameters:  x : array_like
Input array in radians.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  y : ndarray of floats
The corresponding degree values; if out was supplied this is a reference to it. This is a scalar if x is a scalar.
See also
rad2deg
 equivalent function
Examples
Convert a radian array to degrees
>>> rad = np.arange(12.)*np.pi/6 # doctest: +SKIP >>> np.degrees(rad) # doctest: +SKIP array([ 0., 30., 60., 90., 120., 150., 180., 210., 240., 270., 300., 330.])
>>> out = np.zeros((rad.shape)) # doctest: +SKIP >>> r = degrees(rad, out) # doctest: +SKIP >>> np.all(r == out) # doctest: +SKIP True

dask.array.
diag
(v)¶ Extract a diagonal or construct a diagonal array.
This docstring was copied from numpy.diag.
Some inconsistencies with the Dask version may exist.
See the more detailed documentation for
numpy.diagonal
if you use this function to extract a diagonal and wish to write to the resulting array; whether it returns a copy or a view depends on what version of numpy you are using.Parameters:  v : array_like
If v is a 2D array, return a copy of its kth diagonal. If v is a 1D array, return a 2D array with v on the kth diagonal.
 k : int, optional (Not supported in Dask)
Diagonal in question. The default is 0. Use k>0 for diagonals above the main diagonal, and k<0 for diagonals below the main diagonal.
Returns:  out : ndarray
The extracted diagonal or constructed diagonal array.
See also
Examples
>>> x = np.arange(9).reshape((3,3)) # doctest: +SKIP >>> x # doctest: +SKIP array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
>>> np.diag(x) # doctest: +SKIP array([0, 4, 8]) >>> np.diag(x, k=1) # doctest: +SKIP array([1, 5]) >>> np.diag(x, k=1) # doctest: +SKIP array([3, 7])
>>> np.diag(np.diag(x)) # doctest: +SKIP array([[0, 0, 0], [0, 4, 0], [0, 0, 8]])

dask.array.
diagonal
(a, offset=0, axis1=0, axis2=1)¶ Return specified diagonals.
This docstring was copied from numpy.diagonal.
Some inconsistencies with the Dask version may exist.
If a is 2D, returns the diagonal of a with the given offset, i.e., the collection of elements of the form
a[i, i+offset]
. If a has more than two dimensions, then the axes specified by axis1 and axis2 are used to determine the 2D subarray whose diagonal is returned. The shape of the resulting array can be determined by removing axis1 and axis2 and appending an index to the right equal to the size of the resulting diagonals.In versions of NumPy prior to 1.7, this function always returned a new, independent array containing a copy of the values in the diagonal.
In NumPy 1.7 and 1.8, it continues to return a copy of the diagonal, but depending on this fact is deprecated. Writing to the resulting array continues to work as it used to, but a FutureWarning is issued.
Starting in NumPy 1.9 it returns a readonly view on the original array. Attempting to write to the resulting array will produce an error.
In some future release, it will return a read/write view and writing to the returned array will alter your original array. The returned array will have the same type as the input array.
If you don’t write to the array returned by this function, then you can just ignore all of the above.
If you depend on the current behavior, then we suggest copying the returned array explicitly, i.e., use
np.diagonal(a).copy()
instead of justnp.diagonal(a)
. This will work with both past and future versions of NumPy.Parameters:  a : array_like
Array from which the diagonals are taken.
 offset : int, optional
Offset of the diagonal from the main diagonal. Can be positive or negative. Defaults to main diagonal (0).
 axis1 : int, optional
Axis to be used as the first axis of the 2D subarrays from which the diagonals should be taken. Defaults to first axis (0).
 axis2 : int, optional
Axis to be used as the second axis of the 2D subarrays from which the diagonals should be taken. Defaults to second axis (1).
Returns:  array_of_diagonals : ndarray
If a is 2D, then a 1D array containing the diagonal and of the same type as a is returned unless a is a matrix, in which case a 1D array rather than a (2D) matrix is returned in order to maintain backward compatibility.
If
a.ndim > 2
, then the dimensions specified by axis1 and axis2 are removed, and a new axis inserted at the end corresponding to the diagonal.
Raises:  ValueError
If the dimension of a is less than 2.
See also
diag
 MATLAB workalike for 1D and 2D arrays.
diagflat
 Create diagonal arrays.
trace
 Sum along diagonals.
Examples
>>> a = np.arange(4).reshape(2,2) # doctest: +SKIP >>> a # doctest: +SKIP array([[0, 1], [2, 3]]) >>> a.diagonal() # doctest: +SKIP array([0, 3]) >>> a.diagonal(1) # doctest: +SKIP array([1])
A 3D example:
>>> a = np.arange(8).reshape(2,2,2); a # doctest: +SKIP array([[[0, 1], [2, 3]], [[4, 5], [6, 7]]]) >>> a.diagonal(0, # Main diagonals of two arrays created by skipping # doctest: +SKIP ... 0, # across the outer(left)most axis last and ... 1) # the "middle" (row) axis first. array([[0, 6], [1, 7]])
The subarrays whose main diagonals we just obtained; note that each corresponds to fixing the rightmost (column) axis, and that the diagonals are “packed” in rows.
>>> a[:,:,0] # main diagonal is [0 6] # doctest: +SKIP array([[0, 2], [4, 6]]) >>> a[:,:,1] # main diagonal is [1 7] # doctest: +SKIP array([[1, 3], [5, 7]])

dask.array.
diff
(a, n=1, axis=1)¶ Calculate the nth discrete difference along the given axis.
This docstring was copied from numpy.diff.
Some inconsistencies with the Dask version may exist.
The first difference is given by
out[n] = a[n+1]  a[n]
along the given axis, higher differences are calculated by using diff recursively.Parameters:  a : array_like
Input array
 n : int, optional
The number of times values are differenced. If zero, the input is returned asis.
 axis : int, optional
The axis along which the difference is taken, default is the last axis.
 prepend, append : array_like, optional
Values to prepend or append to “a” along axis prior to performing the difference. Scalar values are expanded to arrays with length 1 in the direction of axis and the shape of the input array in along all other axes. Otherwise the dimension and shape must match “a” except along axis.
Returns:  diff : ndarray
The nth differences. The shape of the output is the same as a except along axis where the dimension is smaller by n. The type of the output is the same as the type of the difference between any two elements of a. This is the same as the type of a in most cases. A notable exception is datetime64, which results in a timedelta64 output array.
Notes
Type is preserved for boolean arrays, so the result will contain False when consecutive elements are the same and True when they differ.
For unsigned integer arrays, the results will also be unsigned. This should not be surprising, as the result is consistent with calculating the difference directly:
>>> u8_arr = np.array([1, 0], dtype=np.uint8) # doctest: +SKIP >>> np.diff(u8_arr) # doctest: +SKIP array([255], dtype=uint8) >>> u8_arr[1,...]  u8_arr[0,...] # doctest: +SKIP array(255, np.uint8)
If this is not desirable, then the array should be cast to a larger integer type first:
>>> i16_arr = u8_arr.astype(np.int16) # doctest: +SKIP >>> np.diff(i16_arr) # doctest: +SKIP array([1], dtype=int16)
Examples
>>> x = np.array([1, 2, 4, 7, 0]) # doctest: +SKIP >>> np.diff(x) # doctest: +SKIP array([ 1, 2, 3, 7]) >>> np.diff(x, n=2) # doctest: +SKIP array([ 1, 1, 10])
>>> x = np.array([[1, 3, 6, 10], [0, 5, 6, 8]]) # doctest: +SKIP >>> np.diff(x) # doctest: +SKIP array([[2, 3, 4], [5, 1, 2]]) >>> np.diff(x, axis=0) # doctest: +SKIP array([[1, 2, 0, 2]])
>>> x = np.arange('10661013', '10661016', dtype=np.datetime64) # doctest: +SKIP >>> np.diff(x) # doctest: +SKIP array([1, 1], dtype='timedelta64[D]')

dask.array.
digitize
(a, bins, right=False)¶ Return the indices of the bins to which each value in input array belongs.
This docstring was copied from numpy.digitize.
Some inconsistencies with the Dask version may exist.
right order of bins returned index i satisfies False
increasing bins[i1] <= x < bins[i]
True
increasing bins[i1] < x <= bins[i]
False
decreasing bins[i1] > x >= bins[i]
True
decreasing bins[i1] >= x > bins[i]
If values in x are beyond the bounds of bins, 0 or
len(bins)
is returned as appropriate.Parameters:  x : array_like (Not supported in Dask)
Input array to be binned. Prior to NumPy 1.10.0, this array had to be 1dimensional, but can now have any shape.
 bins : array_like
Array of bins. It has to be 1dimensional and monotonic.
 right : bool, optional
Indicating whether the intervals include the right or the left bin edge. Default behavior is (right==False) indicating that the interval does not include the right edge. The left bin end is open in this case, i.e., bins[i1] <= x < bins[i] is the default behavior for monotonically increasing bins.
Returns:  indices : ndarray of ints
Output array of indices, of same shape as x.
Raises:  ValueError
If bins is not monotonic.
 TypeError
If the type of the input is complex.
Notes
If values in x are such that they fall outside the bin range, attempting to index bins with the indices that digitize returns will result in an IndexError.
New in version 1.10.0.
np.digitize is implemented in terms of np.searchsorted. This means that a binary search is used to bin the values, which scales much better for larger number of bins than the previous linear search. It also removes the requirement for the input array to be 1dimensional.
For monotonically _increasing_ bins, the following are equivalent:
np.digitize(x, bins, right=True) np.searchsorted(bins, x, side='left')
Note that as the order of the arguments are reversed, the side must be too. The searchsorted call is marginally faster, as it does not do any monotonicity checks. Perhaps more importantly, it supports all dtypes.
Examples
>>> x = np.array([0.2, 6.4, 3.0, 1.6]) # doctest: +SKIP >>> bins = np.array([0.0, 1.0, 2.5, 4.0, 10.0]) # doctest: +SKIP >>> inds = np.digitize(x, bins) # doctest: +SKIP >>> inds # doctest: +SKIP array([1, 4, 3, 2]) >>> for n in range(x.size): # doctest: +SKIP ... print(bins[inds[n]1], "<=", x[n], "<", bins[inds[n]]) ... 0.0 <= 0.2 < 1.0 4.0 <= 6.4 < 10.0 2.5 <= 3.0 < 4.0 1.0 <= 1.6 < 2.5
>>> x = np.array([1.2, 10.0, 12.4, 15.5, 20.]) # doctest: +SKIP >>> bins = np.array([0, 5, 10, 15, 20]) # doctest: +SKIP >>> np.digitize(x,bins,right=True) # doctest: +SKIP array([1, 2, 3, 4, 4]) >>> np.digitize(x,bins,right=False) # doctest: +SKIP array([1, 3, 3, 4, 5])

dask.array.
dot
(a, b, out=None)¶ This docstring was copied from numpy.dot.
Some inconsistencies with the Dask version may exist.
Dot product of two arrays. Specifically,
If both a and b are 1D arrays, it is inner product of vectors (without complex conjugation).
If both a and b are 2D arrays, it is matrix multiplication, but using
matmul()
ora @ b
is preferred.If either a or b is 0D (scalar), it is equivalent to
multiply()
and usingnumpy.multiply(a, b)
ora * b
is preferred.If a is an ND array and b is a 1D array, it is a sum product over the last axis of a and b.
If a is an ND array and b is an MD array (where
M>=2
), it is a sum product over the last axis of a and the secondtolast axis of b:dot(a, b)[i,j,k,m] = sum(a[i,j,:] * b[k,:,m])
Parameters:  a : array_like
First argument.
 b : array_like
Second argument.
 out : ndarray, optional
Output argument. This must have the exact kind that would be returned if it was not used. In particular, it must have the right type, must be Ccontiguous, and its dtype must be the dtype that would be returned for dot(a,b). This is a performance feature. Therefore, if these conditions are not met, an exception is raised, instead of attempting to be flexible.
Returns:  output : ndarray
Returns the dot product of a and b. If a and b are both scalars or both 1D arrays then a scalar is returned; otherwise an array is returned. If out is given, then it is returned.
Raises:  ValueError
If the last dimension of a is not the same size as the secondtolast dimension of b.
See also
Examples
>>> np.dot(3, 4) # doctest: +SKIP 12
Neither argument is complexconjugated:
>>> np.dot([2j, 3j], [2j, 3j]) # doctest: +SKIP (13+0j)
For 2D arrays it is the matrix product:
>>> a = [[1, 0], [0, 1]] # doctest: +SKIP >>> b = [[4, 1], [2, 2]] # doctest: +SKIP >>> np.dot(a, b) # doctest: +SKIP array([[4, 1], [2, 2]])
>>> a = np.arange(3*4*5*6).reshape((3,4,5,6)) # doctest: +SKIP >>> b = np.arange(3*4*5*6)[::1].reshape((5,4,6,3)) # doctest: +SKIP >>> np.dot(a, b)[2,3,2,1,2,2] # doctest: +SKIP 499128 >>> sum(a[2,3,2,:] * b[1,2,:,2]) # doctest: +SKIP 499128

dask.array.
dstack
(tup, allow_unknown_chunksizes=False)¶ Stack arrays in sequence depth wise (along third axis).
This docstring was copied from numpy.dstack.
Some inconsistencies with the Dask version may exist.
This is equivalent to concatenation along the third axis after 2D arrays of shape (M,N) have been reshaped to (M,N,1) and 1D arrays of shape (N,) have been reshaped to (1,N,1). Rebuilds arrays divided by dsplit.
This function makes most sense for arrays with up to 3 dimensions. For instance, for pixeldata with a height (first axis), width (second axis), and r/g/b channels (third axis). The functions concatenate, stack and block provide more general stacking and concatenation operations.
Parameters:  tup : sequence of arrays
The arrays must have the same shape along all but the third axis. 1D or 2D arrays must have the same shape.
Returns:  stacked : ndarray
The array formed by stacking the given arrays, will be at least 3D.
See also
stack
 Join a sequence of arrays along a new axis.
vstack
 Stack along first axis.
hstack
 Stack along second axis.
concatenate
 Join a sequence of arrays along an existing axis.
dsplit
 Split array along third axis.
Examples
>>> a = np.array((1,2,3)) # doctest: +SKIP >>> b = np.array((2,3,4)) # doctest: +SKIP >>> np.dstack((a,b)) # doctest: +SKIP array([[[1, 2], [2, 3], [3, 4]]])
>>> a = np.array([[1],[2],[3]]) # doctest: +SKIP >>> b = np.array([[2],[3],[4]]) # doctest: +SKIP >>> np.dstack((a,b)) # doctest: +SKIP array([[[1, 2]], [[2, 3]], [[3, 4]]])

dask.array.
ediff1d
(ary, to_end=None, to_begin=None)¶ The differences between consecutive elements of an array.
This docstring was copied from numpy.ediff1d.
Some inconsistencies with the Dask version may exist.
Parameters:  ary : array_like
If necessary, will be flattened before the differences are taken.
 to_end : array_like, optional
Number(s) to append at the end of the returned differences.
 to_begin : array_like, optional
Number(s) to prepend at the beginning of the returned differences.
Returns:  ediff1d : ndarray
The differences. Loosely, this is
ary.flat[1:]  ary.flat[:1]
.
Notes
When applied to masked arrays, this function drops the mask information if the to_begin and/or to_end parameters are used.
Examples
>>> x = np.array([1, 2, 4, 7, 0]) # doctest: +SKIP >>> np.ediff1d(x) # doctest: +SKIP array([ 1, 2, 3, 7])
>>> np.ediff1d(x, to_begin=99, to_end=np.array([88, 99])) # doctest: +SKIP array([99, 1, 2, 3, 7, 88, 99])
The returned array is always 1D.
>>> y = [[1, 2, 4], [1, 6, 24]] # doctest: +SKIP >>> np.ediff1d(y) # doctest: +SKIP array([ 1, 2, 3, 5, 18])

dask.array.
empty
(*args, **kwargs)¶ Blocked variant of empty
Follows the signature of empty exactly except that it also requires a keyword argument chunks=(…)
Original signature follows below. empty(shape, dtype=float, order=’C’)
Return a new array of given shape and type, without initializing entries.
Parameters:  shape : int or tuple of int
Shape of the empty array, e.g.,
(2, 3)
or2
. dtype : datatype, optional
Desired output datatype for the array, e.g, numpy.int8. Default is numpy.float64.
 order : {‘C’, ‘F’}, optional, default: ‘C’
Whether to store multidimensional data in rowmajor (Cstyle) or columnmajor (Fortranstyle) order in memory.
Returns:  out : ndarray
Array of uninitialized (arbitrary) data of the given shape, dtype, and order. Object arrays will be initialized to None.
See also
empty_like
 Return an empty array with shape and type of input.
ones
 Return a new array setting values to one.
zeros
 Return a new array setting values to zero.
full
 Return a new array of given shape filled with value.
Notes
empty, unlike zeros, does not set the array values to zero, and may therefore be marginally faster. On the other hand, it requires the user to manually set all the values in the array, and should be used with caution.
Examples
>>> np.empty([2, 2]) array([[ 9.74499359e+001, 6.69583040e309], [ 2.13182611e314, 3.06959433e309]]) #random
>>> np.empty([2, 2], dtype=int) array([[1073741821, 1067949133], [ 496041986, 19249760]]) #random

dask.array.
empty_like
(a, dtype=None, chunks=None)¶ Return a new array with the same shape and type as a given array.
Parameters:  a : array_like
The shape and datatype of a define these same attributes of the returned array.
 dtype : datatype, optional
Overrides the data type of the result.
 chunks : sequence of ints
The number of samples on each block. Note that the last block will have fewer samples if
len(array) % chunks != 0
.
Returns:  out : ndarray
Array of uninitialized (arbitrary) data with the same shape and type as a.
See also
ones_like
 Return an array of ones with shape and type of input.
zeros_like
 Return an array of zeros with shape and type of input.
empty
 Return a new uninitialized array.
ones
 Return a new array setting values to one.
zeros
 Return a new array setting values to zero.
Notes
This function does not initialize the returned array; to do that use zeros_like or ones_like instead. It may be marginally faster than the functions that do set the array values.

dask.array.
einsum
(subscripts, *operands, out=None, dtype=None, order='K', casting='safe', optimize=False)¶ This docstring was copied from numpy.einsum.
Some inconsistencies with the Dask version may exist.
Evaluates the Einstein summation convention on the operands.
Using the Einstein summation convention, many common multidimensional, linear algebraic array operations can be represented in a simple fashion. In implicit mode einsum computes these values.
In explicit mode, einsum provides further flexibility to compute other array operations that might not be considered classical Einstein summation operations, by disabling, or forcing summation over specified subscript labels.
See the notes and examples for clarification.
Parameters:  subscripts : str
Specifies the subscripts for summation as comma separated list of subscript labels. An implicit (classical Einstein summation) calculation is performed unless the explicit indicator ‘>’ is included as well as subscript labels of the precise output form.
 operands : list of array_like
These are the arrays for the operation.
 out : ndarray, optional
If provided, the calculation is done into this array.
 dtype : {datatype, None}, optional
If provided, forces the calculation to use the data type specified. Note that you may have to also give a more liberal casting parameter to allow the conversions. Default is None.
 order : {‘C’, ‘F’, ‘A’, ‘K’}, optional
Controls the memory layout of the output. ‘C’ means it should be C contiguous. ‘F’ means it should be Fortran contiguous, ‘A’ means it should be ‘F’ if the inputs are all ‘F’, ‘C’ otherwise. ‘K’ means it should be as close to the layout as the inputs as is possible, including arbitrarily permuted axes. Default is ‘K’.
 casting : {‘no’, ‘equiv’, ‘safe’, ‘same_kind’, ‘unsafe’}, optional
Controls what kind of data casting may occur. Setting this to ‘unsafe’ is not recommended, as it can adversely affect accumulations.
 ‘no’ means the data types should not be cast at all.
 ‘equiv’ means only byteorder changes are allowed.
 ‘safe’ means only casts which can preserve values are allowed.
 ‘same_kind’ means only safe casts or casts within a kind, like float64 to float32, are allowed.
 ‘unsafe’ means any data conversions may be done.
Default is ‘safe’.
 optimize : {False, True, ‘greedy’, ‘optimal’}, optional
Controls if intermediate optimization should occur. No optimization will occur if False and True will default to the ‘greedy’ algorithm. Also accepts an explicit contraction list from the
np.einsum_path
function. Seenp.einsum_path
for more details. Defaults to False.
Returns:  output : ndarray
The calculation based on the Einstein summation convention.
Notes
New in version 1.6.0.
The Einstein summation convention can be used to compute many multidimensional, linear algebraic array operations. einsum provides a succinct way of representing these.
A nonexhaustive list of these operations, which can be computed by einsum, is shown below along with examples:
 Trace of an array,
numpy.trace()
.  Return a diagonal,
numpy.diag()
.  Array axis summations,
numpy.sum()
.  Transpositions and permutations,
numpy.transpose()
.  Matrix multiplication and dot product,
numpy.matmul()
numpy.dot()
.  Vector inner and outer products,
numpy.inner()
numpy.outer()
.  Broadcasting, elementwise and scalar multiplication,
numpy.multiply()
.  Tensor contractions,
numpy.tensordot()
.  Chained array operations, in efficient calculation order,
numpy.einsum_path()
.
The subscripts string is a commaseparated list of subscript labels, where each label refers to a dimension of the corresponding operand. Whenever a label is repeated it is summed, so
np.einsum('i,i', a, b)
is equivalent tonp.inner(a,b)
. If a label appears only once, it is not summed, sonp.einsum('i', a)
produces a view ofa
with no changes. A further examplenp.einsum('ij,jk', a, b)
describes traditional matrix multiplication and is equivalent tonp.matmul(a,b)
. Repeated subscript labels in one operand take the diagonal. For example,np.einsum('ii', a)
is equivalent tonp.trace(a)
.In implicit mode, the chosen subscripts are important since the axes of the output are reordered alphabetically. This means that
np.einsum('ij', a)
doesn’t affect a 2D array, whilenp.einsum('ji', a)
takes its transpose. Additionally,np.einsum('ij,jk', a, b)
returns a matrix multiplication, while,np.einsum('ij,jh', a, b)
returns the transpose of the multiplication since subscript ‘h’ precedes subscript ‘i’.In explicit mode the output can be directly controlled by specifying output subscript labels. This requires the identifier ‘>’ as well as the list of output subscript labels. This feature increases the flexibility of the function since summing can be disabled or forced when required. The call
np.einsum('i>', a)
is likenp.sum(a, axis=1)
, andnp.einsum('ii>i', a)
is likenp.diag(a)
. The difference is that einsum does not allow broadcasting by default. Additionallynp.einsum('ij,jh>ih', a, b)
directly specifies the order of the output subscript labels and therefore returns matrix multiplication, unlike the example above in implicit mode.To enable and control broadcasting, use an ellipsis. Default NumPystyle broadcasting is done by adding an ellipsis to the left of each term, like
np.einsum('...ii>...i', a)
. To take the trace along the first and last axes, you can donp.einsum('i...i', a)
, or to do a matrixmatrix product with the leftmost indices instead of rightmost, one can donp.einsum('ij...,jk...>ik...', a, b)
.When there is only one operand, no axes are summed, and no output parameter is provided, a view into the operand is returned instead of a new array. Thus, taking the diagonal as
np.einsum('ii>i', a)
produces a view (changed in version 1.10.0).einsum also provides an alternative way to provide the subscripts and operands as
einsum(op0, sublist0, op1, sublist1, ..., [sublistout])
. If the output shape is not provided in this format einsum will be calculated in implicit mode, otherwise it will be performed explicitly. The examples below have corresponding einsum calls with the two parameter methods.New in version 1.10.0.
Views returned from einsum are now writeable whenever the input array is writeable. For example,
np.einsum('ijk...>kji...', a)
will now have the same effect asnp.swapaxes(a, 0, 2)
andnp.einsum('ii>i', a)
will return a writeable view of the diagonal of a 2D array.New in version 1.12.0.
Added the
optimize
argument which will optimize the contraction order of an einsum expression. For a contraction with three or more operands this can greatly increase the computational efficiency at the cost of a larger memory footprint during computation.Typically a ‘greedy’ algorithm is applied which empirical tests have shown returns the optimal path in the majority of cases. In some cases ‘optimal’ will return the superlative path through a more expensive, exhaustive search. For iterative calculations it may be advisable to calculate the optimal path once and reuse that path by supplying it as an argument. An example is given below.
See
numpy.einsum_path()
for more details.Examples
>>> a = np.arange(25).reshape(5,5) # doctest: +SKIP >>> b = np.arange(5) # doctest: +SKIP >>> c = np.arange(6).reshape(2,3) # doctest: +SKIP
Trace of a matrix:
>>> np.einsum('ii', a) # doctest: +SKIP 60 >>> np.einsum(a, [0,0]) # doctest: +SKIP 60 >>> np.trace(a) # doctest: +SKIP 60
Extract the diagonal (requires explicit form):
>>> np.einsum('ii>i', a) # doctest: +SKIP array([ 0, 6, 12, 18, 24]) >>> np.einsum(a, [0,0], [0]) # doctest: +SKIP array([ 0, 6, 12, 18, 24]) >>> np.diag(a) # doctest: +SKIP array([ 0, 6, 12, 18, 24])
Sum over an axis (requires explicit form):
>>> np.einsum('ij>i', a) # doctest: +SKIP array([ 10, 35, 60, 85, 110]) >>> np.einsum(a, [0,1], [0]) # doctest: +SKIP array([ 10, 35, 60, 85, 110]) >>> np.sum(a, axis=1) # doctest: +SKIP array([ 10, 35, 60, 85, 110])
For higher dimensional arrays summing a single axis can be done with ellipsis:
>>> np.einsum('...j>...', a) # doctest: +SKIP array([ 10, 35, 60, 85, 110]) >>> np.einsum(a, [Ellipsis,1], [Ellipsis]) # doctest: +SKIP array([ 10, 35, 60, 85, 110])
Compute a matrix transpose, or reorder any number of axes:
>>> np.einsum('ji', c) # doctest: +SKIP array([[0, 3], [1, 4], [2, 5]]) >>> np.einsum('ij>ji', c) # doctest: +SKIP array([[0, 3], [1, 4], [2, 5]]) >>> np.einsum(c, [1,0]) # doctest: +SKIP array([[0, 3], [1, 4], [2, 5]]) >>> np.transpose(c) # doctest: +SKIP array([[0, 3], [1, 4], [2, 5]])
Vector inner products:
>>> np.einsum('i,i', b, b) # doctest: +SKIP 30 >>> np.einsum(b, [0], b, [0]) # doctest: +SKIP 30 >>> np.inner(b,b) # doctest: +SKIP 30
Matrix vector multiplication:
>>> np.einsum('ij,j', a, b) # doctest: +SKIP array([ 30, 80, 130, 180, 230]) >>> np.einsum(a, [0,1], b, [1]) # doctest: +SKIP array([ 30, 80, 130, 180, 230]) >>> np.dot(a, b) # doctest: +SKIP array([ 30, 80, 130, 180, 230]) >>> np.einsum('...j,j', a, b) # doctest: +SKIP array([ 30, 80, 130, 180, 230])
Broadcasting and scalar multiplication:
>>> np.einsum('..., ...', 3, c) # doctest: +SKIP array([[ 0, 3, 6], [ 9, 12, 15]]) >>> np.einsum(',ij', 3, c) # doctest: +SKIP array([[ 0, 3, 6], [ 9, 12, 15]]) >>> np.einsum(3, [Ellipsis], c, [Ellipsis]) # doctest: +SKIP array([[ 0, 3, 6], [ 9, 12, 15]]) >>> np.multiply(3, c) # doctest: +SKIP array([[ 0, 3, 6], [ 9, 12, 15]])
Vector outer product:
>>> np.einsum('i,j', np.arange(2)+1, b) # doctest: +SKIP array([[0, 1, 2, 3, 4], [0, 2, 4, 6, 8]]) >>> np.einsum(np.arange(2)+1, [0], b, [1]) # doctest: +SKIP array([[0, 1, 2, 3, 4], [0, 2, 4, 6, 8]]) >>> np.outer(np.arange(2)+1, b) # doctest: +SKIP array([[0, 1, 2, 3, 4], [0, 2, 4, 6, 8]])
Tensor contraction:
>>> a = np.arange(60.).reshape(3,4,5) # doctest: +SKIP >>> b = np.arange(24.).reshape(4,3,2) # doctest: +SKIP >>> np.einsum('ijk,jil>kl', a, b) # doctest: +SKIP array([[ 4400., 4730.], [ 4532., 4874.], [ 4664., 5018.], [ 4796., 5162.], [ 4928., 5306.]]) >>> np.einsum(a, [0,1,2], b, [1,0,3], [2,3]) # doctest: +SKIP array([[ 4400., 4730.], [ 4532., 4874.], [ 4664., 5018.], [ 4796., 5162.], [ 4928., 5306.]]) >>> np.tensordot(a,b, axes=([1,0],[0,1])) # doctest: +SKIP array([[ 4400., 4730.], [ 4532., 4874.], [ 4664., 5018.], [ 4796., 5162.], [ 4928., 5306.]])
Writeable returned arrays (since version 1.10.0):
>>> a = np.zeros((3, 3)) # doctest: +SKIP >>> np.einsum('ii>i', a)[:] = 1 # doctest: +SKIP >>> a # doctest: +SKIP array([[ 1., 0., 0.], [ 0., 1., 0.], [ 0., 0., 1.]])
Example of ellipsis use:
>>> a = np.arange(6).reshape((3,2)) # doctest: +SKIP >>> b = np.arange(12).reshape((4,3)) # doctest: +SKIP >>> np.einsum('ki,jk>ij', a, b) # doctest: +SKIP array([[10, 28, 46, 64], [13, 40, 67, 94]]) >>> np.einsum('ki,...k>i...', a, b) # doctest: +SKIP array([[10, 28, 46, 64], [13, 40, 67, 94]]) >>> np.einsum('k...,jk', a, b) # doctest: +SKIP array([[10, 28, 46, 64], [13, 40, 67, 94]])
Chained array operations. For more complicated contractions, speed ups might be achieved by repeatedly computing a ‘greedy’ path or precomputing the ‘optimal’ path and repeatedly applying it, using an einsum_path insertion (since version 1.12.0). Performance improvements can be particularly significant with larger arrays:
>>> a = np.ones(64).reshape(2,4,8) # doctest: +SKIP # Basic `einsum`: ~1520ms (benchmarked on 3.1GHz Intel i5.) >>> for iteration in range(500): # doctest: +SKIP ... np.einsum('ijk,ilm,njm,nlk,abc>',a,a,a,a,a) # Suboptimal `einsum` (due to repeated path calculation time): ~330ms >>> for iteration in range(500): # doctest: +SKIP ... np.einsum('ijk,ilm,njm,nlk,abc>',a,a,a,a,a, optimize='optimal') # Greedy `einsum` (faster optimal path approximation): ~160ms >>> for iteration in range(500): # doctest: +SKIP ... np.einsum('ijk,ilm,njm,nlk,abc>',a,a,a,a,a, optimize='greedy') # Optimal `einsum` (best usage pattern in some use cases): ~110ms >>> path = np.einsum_path('ijk,ilm,njm,nlk,abc>',a,a,a,a,a, optimize='optimal')[0] # doctest: +SKIP >>> for iteration in range(500): # doctest: +SKIP ... np.einsum('ijk,ilm,njm,nlk,abc>',a,a,a,a,a, optimize=path)

dask.array.
exp
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Calculate the exponential of all elements in the input array.
Parameters:  x : array_like
Input values.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  out : ndarray or scalar
Output array, elementwise exponential of x. This is a scalar if x is a scalar.
See also
expm1
 Calculate
exp(x)  1
for all elements in the array. exp2
 Calculate
2**x
for all elements in the array.
Notes
The irrational number
e
is also known as Euler’s number. It is approximately 2.718281, and is the base of the natural logarithm,ln
(this means that, if \(x = \ln y = \log_e y\), then \(e^x = y\). For real input,exp(x)
is always positive.For complex arguments,
x = a + ib
, we can write \(e^x = e^a e^{ib}\). The first term, \(e^a\), is already known (it is the real argument, described above). The second term, \(e^{ib}\), is \(\cos b + i \sin b\), a function with magnitude 1 and a periodic phase.References
[1] Wikipedia, “Exponential function”, https://en.wikipedia.org/wiki/Exponential_function [2] M. Abramovitz and I. A. Stegun, “Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables,” Dover, 1964, p. 69, http://www.math.sfu.ca/~cbm/aands/page_69.htm Examples
Plot the magnitude and phase of
exp(x)
in the complex plane:>>> import matplotlib.pyplot as plt # doctest: +SKIP
>>> x = np.linspace(2*np.pi, 2*np.pi, 100) # doctest: +SKIP >>> xx = x + 1j * x[:, np.newaxis] # a + ib over complex plane # doctest: +SKIP >>> out = np.exp(xx) # doctest: +SKIP
>>> plt.subplot(121) # doctest: +SKIP >>> plt.imshow(np.abs(out), # doctest: +SKIP ... extent=[2*np.pi, 2*np.pi, 2*np.pi, 2*np.pi], cmap='gray') >>> plt.title('Magnitude of exp(x)') # doctest: +SKIP
>>> plt.subplot(122) # doctest: +SKIP >>> plt.imshow(np.angle(out), # doctest: +SKIP ... extent=[2*np.pi, 2*np.pi, 2*np.pi, 2*np.pi], cmap='hsv') >>> plt.title('Phase (angle) of exp(x)') # doctest: +SKIP >>> plt.show() # doctest: +SKIP

dask.array.
expm1
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Calculate
exp(x)  1
for all elements in the array.Parameters:  x : array_like
Input values.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  out : ndarray or scalar
Elementwise exponential minus one:
out = exp(x)  1
. This is a scalar if x is a scalar.
See also
log1p
log(1 + x)
, the inverse of expm1.
Notes
This function provides greater precision than
exp(x)  1
for small values ofx
.Examples
The true value of
exp(1e10)  1
is1.00000000005e10
to about 32 significant digits. This example shows the superiority of expm1 in this case.>>> np.expm1(1e10) # doctest: +SKIP 1.00000000005e10 >>> np.exp(1e10)  1 # doctest: +SKIP 1.000000082740371e10

dask.array.
eye
(N, chunks='auto', M=None, k=0, dtype=<class 'float'>)¶ Return a 2D Array with ones on the diagonal and zeros elsewhere.
Parameters:  N : int
Number of rows in the output.
 chunks : int, str
How to chunk the array. Must be one of the following forms:
 A blocksize like 1000.
 A size in bytes, like “100 MiB” which will choose a uniform blocklike shape
 The word “auto” which acts like the above, but uses a configuration
value
array.chunksize
for the chunk size
 M : int, optional
Number of columns in the output. If None, defaults to N.
 k : int, optional
Index of the diagonal: 0 (the default) refers to the main diagonal, a positive value refers to an upper diagonal, and a negative value to a lower diagonal.
 dtype : datatype, optional
Datatype of the returned array.
Returns:  I : Array of shape (N,M)
An array where all elements are equal to zero, except for the kth diagonal, whose values are equal to one.

dask.array.
fabs
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Compute the absolute values elementwise.
This function returns the absolute values (positive magnitude) of the data in x. Complex values are not handled, use absolute to find the absolute values of complex data.
Parameters:  x : array_like
The array of numbers for which the absolute values are required. If x is a scalar, the result y will also be a scalar.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  y : ndarray or scalar
The absolute values of x, the returned values are always floats. This is a scalar if x is a scalar.
See also
absolute
 Absolute values including complex types.
Examples
>>> np.fabs(1) # doctest: +SKIP 1.0 >>> np.fabs([1.2, 1.2]) # doctest: +SKIP array([ 1.2, 1.2])

dask.array.
fix
(*args, **kwargs)¶ Round to nearest integer towards zero.
Round an array of floats elementwise to nearest integer towards zero. The rounded values are returned as floats.
Parameters:  x : array_like
An array of floats to be rounded
 y : ndarray, optional
Output array
Returns:  out : ndarray of floats
The array of rounded numbers
Examples
>>> np.fix(3.14) # doctest: +SKIP 3.0 >>> np.fix(3) # doctest: +SKIP 3.0 >>> np.fix([2.1, 2.9, 2.1, 2.9]) # doctest: +SKIP array([ 2., 2., 2., 2.])

dask.array.
flatnonzero
(a)¶ Return indices that are nonzero in the flattened version of a.
This docstring was copied from numpy.flatnonzero.
Some inconsistencies with the Dask version may exist.
This is equivalent to np.nonzero(np.ravel(a))[0].
Parameters:  a : array_like
Input data.
Returns:  res : ndarray
Output array, containing the indices of the elements of a.ravel() that are nonzero.
See also
Examples
>>> x = np.arange(2, 3) # doctest: +SKIP >>> x # doctest: +SKIP array([2, 1, 0, 1, 2]) >>> np.flatnonzero(x) # doctest: +SKIP array([0, 1, 3, 4])
Use the indices of the nonzero elements as an index array to extract these elements:
>>> x.ravel()[np.flatnonzero(x)] # doctest: +SKIP array([2, 1, 1, 2])

dask.array.
flip
(m, axis)¶ Reverse element order along axis.
Parameters:  axis : int
Axis to reverse element order of.
Returns:  reversed array : ndarray

dask.array.
flipud
(m)¶ Flip array in the up/down direction.
This docstring was copied from numpy.flipud.
Some inconsistencies with the Dask version may exist.
Flip the entries in each column in the up/down direction. Rows are preserved, but appear in a different order than before.
Parameters:  m : array_like
Input array.
Returns:  out : array_like
A view of m with the rows reversed. Since a view is returned, this operation is \(\mathcal O(1)\).
See also
fliplr
 Flip array in the left/right direction.
rot90
 Rotate array counterclockwise.
Notes
Equivalent to
m[::1,...]
. Does not require the array to be twodimensional.Examples
>>> A = np.diag([1.0, 2, 3]) # doctest: +SKIP >>> A # doctest: +SKIP array([[ 1., 0., 0.], [ 0., 2., 0.], [ 0., 0., 3.]]) >>> np.flipud(A) # doctest: +SKIP array([[ 0., 0., 3.], [ 0., 2., 0.], [ 1., 0., 0.]])
>>> A = np.random.randn(2,3,5) # doctest: +SKIP >>> np.all(np.flipud(A) == A[::1,...]) # doctest: +SKIP True
>>> np.flipud([1,2]) # doctest: +SKIP array([2, 1])

dask.array.
fliplr
(m)¶ Flip array in the left/right direction.
This docstring was copied from numpy.fliplr.
Some inconsistencies with the Dask version may exist.
Flip the entries in each row in the left/right direction. Columns are preserved, but appear in a different order than before.
Parameters:  m : array_like
Input array, must be at least 2D.
Returns:  f : ndarray
A view of m with the columns reversed. Since a view is returned, this operation is \(\mathcal O(1)\).
See also
flipud
 Flip array in the up/down direction.
rot90
 Rotate array counterclockwise.
Notes
Equivalent to m[:,::1]. Requires the array to be at least 2D.
Examples
>>> A = np.diag([1.,2.,3.]) # doctest: +SKIP >>> A # doctest: +SKIP array([[ 1., 0., 0.], [ 0., 2., 0.], [ 0., 0., 3.]]) >>> np.fliplr(A) # doctest: +SKIP array([[ 0., 0., 1.], [ 0., 2., 0.], [ 3., 0., 0.]])
>>> A = np.random.randn(2,3,5) # doctest: +SKIP >>> np.all(np.fliplr(A) == A[:,::1,...]) # doctest: +SKIP True

dask.array.
floor
(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Return the floor of the input, elementwise.
The floor of the scalar x is the largest integer i, such that i <= x. It is often denoted as \(\lfloor x \rfloor\).
Parameters:  x : array_like
Input data.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  y : ndarray or scalar
The floor of each element in x. This is a scalar if x is a scalar.
Notes
Some spreadsheet programs calculate the “floortowardszero”, in other words
floor(2.5) == 2
. NumPy instead uses the definition of floor where floor(2.5) == 3.Examples
>>> a = np.array([1.7, 1.5, 0.2, 0.2, 1.5, 1.7, 2.0]) # doctest: +SKIP >>> np.floor(a) # doctest: +SKIP array([2., 2., 1., 0., 1., 1., 2.])

dask.array.
fmax
(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Elementwise maximum of array elements.
Compare two arrays and returns a new array containing the elementwise maxima. If one of the elements being compared is a NaN, then the nonnan element is returned. If both elements are NaNs then the first is returned. The latter distinction is important for complex NaNs, which are defined as at least one of the real or imaginary parts being a NaN. The net effect is that NaNs are ignored when possible.
Parameters:  x1, x2 : array_like
The arrays holding the elements to be compared. They must have the same shape.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  y : ndarray or scalar
The maximum of x1 and x2, elementwise. This is a scalar if both x1 and x2 are scalars.
See also
Notes
New in version 1.3.0.
The fmax is equivalent to
np.where(x1 >= x2, x1, x2)
when neither x1 nor x2 are NaNs, but it is faster and does proper broadcasting.Examples
>>> np.fmax([2, 3, 4], [1, 5, 2]) # doctest: +SKIP array([ 2., 5., 4.])
>>> np.fmax(np.eye(2), [0.5, 2]) # doctest: +SKIP array([[ 1. , 2. ], [ 0.5, 2. ]])
>>> np.fmax([np.nan, 0, np.nan],[0, np.nan, np.nan]) # doctest: +SKIP array([ 0., 0., NaN])

dask.array.
fmin
(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Elementwise minimum of array elements.
Compare two arrays and returns a new array containing the elementwise minima. If one of the elements being compared is a NaN, then the nonnan element is returned. If both elements are NaNs then the first is returned. The latter distinction is important for complex NaNs, which are defined as at least one of the real or imaginary parts being a NaN. The net effect is that NaNs are ignored when possible.
Parameters:  x1, x2 : array_like
The arrays holding the elements to be compared. They must have the same shape.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  y : ndarray or scalar
The minimum of x1 and x2, elementwise. This is a scalar if both x1 and x2 are scalars.
See also
Notes
New in version 1.3.0.
The fmin is equivalent to
np.where(x1 <= x2, x1, x2)
when neither x1 nor x2 are NaNs, but it is faster and does proper broadcasting.Examples
>>> np.fmin([2, 3, 4], [1, 5, 2]) # doctest: +SKIP array([1, 3, 2])
>>> np.fmin(np.eye(2), [0.5, 2]) # doctest: +SKIP array([[ 0.5, 0. ], [ 0. , 1. ]])
>>> np.fmin([np.nan, 0, np.nan],[0, np.nan, np.nan]) # doctest: +SKIP array([ 0., 0., NaN])

dask.array.
fmod
(x1, x2, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Return the elementwise remainder of division.
This is the NumPy implementation of the C library function fmod, the remainder has the same sign as the dividend x1. It is equivalent to the Matlab(TM)
rem
function and should not be confused with the Python modulus operatorx1 % x2
.Parameters:  x1 : array_like
Dividend.
 x2 : array_like
Divisor.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  y : array_like
The remainder of the division of x1 by x2. This is a scalar if both x1 and x2 are scalars.
See also
remainder
 Equivalent to the Python
%
operator.
divide
Notes
The result of the modulo operation for negative dividend and divisors is bound by conventions. For fmod, the sign of result is the sign of the dividend, while for remainder the sign of the result is the sign of the divisor. The fmod function is equivalent to the Matlab(TM)
rem
function.Examples
>>> np.fmod([3, 2, 1, 1, 2, 3], 2) # doctest: +SKIP array([1, 0, 1, 1, 0, 1]) >>> np.remainder([3, 2, 1, 1, 2, 3], 2) # doctest: +SKIP array([1, 0, 1, 1, 0, 1])
>>> np.fmod([5, 3], [2, 2.]) # doctest: +SKIP array([ 1., 1.]) >>> a = np.arange(3, 3).reshape(3, 2) # doctest: +SKIP >>> a # doctest: +SKIP array([[3, 2], [1, 0], [ 1, 2]]) >>> np.fmod(a, [2,2]) # doctest: +SKIP array([[1, 0], [1, 0], [ 1, 0]])

dask.array.
frexp
(x, [out1, out2, ]/, [out=(None, None), ]*, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature, extobj])¶ Decompose the elements of x into mantissa and twos exponent.
Returns (mantissa, exponent), where x = mantissa * 2**exponent`. The mantissa is lies in the open interval(1, 1), while the twos exponent is a signed integer.
Parameters:  x : array_like
Array of numbers to be decomposed.
 out1 : ndarray, optional
Output array for the mantissa. Must have the same shape as x.
 out2 : ndarray, optional
Output array for the exponent. Must have the same shape as x.
 out : ndarray, None, or tuple of ndarray and None, optional
A location into which the result is stored. If provided, it must have a shape that the inputs broadcast to. If not provided or None, a freshlyallocated array is returned. A tuple (possible only as a keyword argument) must have length equal to the number of outputs.
 where : array_like, optional
Values of True indicate to calculate the ufunc at that position, values of False indicate to leave the value in the output alone.
 **kwargs
For other keywordonly arguments, see the ufunc docs.
Returns:  mantissa : ndarray
Floating values between 1 and 1. This is a scalar if x is a scalar.
 exponent : ndarray
Integer exponents of 2. This is a scalar if x is a scalar.
See also
ldexp
 Compute
y = x1 * 2**x2
, the inverse of frexp.
Notes
Complex dtypes are not supported, they will raise a TypeError.
Examples
>>> x = np.arange(9) # doctest: +SKIP >>> y1, y2 = np.frexp(x) # doctest: +SKIP >>> y1 # doctest: +SKIP array([ 0. , 0.5 , 0.5 , 0.75 , 0.5 , 0.625, 0.75 , 0.875, 0.5 ]) >>> y2 # doctest: +SKIP array([0, 1, 2, 2, 3, 3, 3, 3, 4]) >>> y1 * 2**y2 # doctest: +SKIP array([ 0., 1., 2., 3., 4., 5., 6., 7., 8.])

dask.array.
fromfunction
(func, chunks='auto', shape=None, dtype=None, **kwargs)¶ Construct an array by executing a function over each coordinate.
This docstring was copied from numpy.fromfunction.
Some inconsistencies with the Dask version may exist.
The resulting array therefore has a value
fn(x, y, z)
at coordinate(x, y, z)
.Parameters:  function : callable (Not supported in Dask)
The function is called with N parameters, where N is the rank of shape. Each parameter represents the coordinates of the array varying along a specific axis. For example, if shape were
(2, 2)
, then the parameters would bearray([[0, 0], [1, 1]])
andarray([[0, 1], [0, 1]])
 shape : (N,) tuple of ints
Shape of the output array, which also determines the shape of the coordinate arrays passed to function.
 dtype : datatype, optional
Datatype of the coordinate arrays passed to function. By default, dtype is float.
Returns:  fromfunction : any
The result of the call to function is passed back directly. Therefore the shape of fromfunction is completely determined by function. If function returns a scalar value, the shape of fromfunction would not match the shape parameter.
Notes
Keywords other than dtype are passed to function.
Examples
>>> np.fromfunction(lambda i, j: i == j, (3, 3), dtype=int) # doctest: +SKIP array([[ True, False, False], [False, True, False], [False, False, True]])
>>> np.fromfunction(lambda i, j: i + j, (3, 3), dtype=int) # doctest: +SKIP array([[0, 1, 2], [1, 2, 3], [2, 3, 4]])

dask.array.
frompyfunc
(func, nin, nout)¶ This docstring was copied from numpy.frompyfunc.
Some inconsistencies with the Dask version may exist.
Takes an arbitrary Python function and returns a NumPy ufunc.
Can be used, for example, to add broadcasting to a builtin Python function (see Examples section).
Parameters:  func : Python function object
An arbitrary Python function.
 nin : int
The number of input arguments.
 nout : int
The number of objects returned by func.
Returns:  out : ufunc
Returns a NumPy universal function (
ufunc
) object.
See also
vectorize
 evaluates pyfunc over input arrays using broadcasting rules of numpy
Notes
The returned ufunc always returns PyObject arrays.
Examples
Use frompyfunc to add broadcasting to the Python function
oct
:>>> oct_array = np.frompyfunc(oct, 1, 1) # doctest: +SKIP >>> oct_array(np.array((10, 30, 100))) # doctest: +SKIP array([012, 036, 0144], dtype=object) >>> np.array((oct(10), oct(30), oct(100))) # for comparison # doctest: +SKIP array(['012', '036', '0144'], dtype='S4')

dask.array.
full
(*args, **kwargs)¶ Blocked variant of full
Follows the signature of full exactly except that it also requires a keyword argument chunks=(…)
Original signature follows below.
Return a new array of given shape and type, filled with fill_value.
Parameters:  shape : int or sequence of ints
Shape of the new array, e.g.,
(2, 3)
or2
. fill_value : scalar
Fill value.
 dtype : datatype, optional
 The desired datatype for the array The default, None, means
np.array(fill_value).dtype.
 order : {‘C’, ‘F’}, optional
Whether to store multidimensional data in C or Fortrancontiguous (row or columnwise) order in memory.
Returns:  out : ndarray
Array of fill_value with the given shape, dtype, and order.
See also
Examples
>>> np.full((2, 2), np.inf) array([[ inf, inf], [ inf, inf]]) >>> np.full((2, 2), 10) array([[10, 10], [10, 10]])

dask.array.
full_like
(a, fill_value, dtype=None, chunks=None)¶ Return a full array with the same shape and type as a given array.
Parameters:  a : array_like
The shape and datatype of a define these same attributes of the returned array.
 fill_value : scalar
Fill value.
 dtype : datatype, optional
Overrides the data type of the result.
 chunks : sequence of ints
The number of samples on each block. Note that the last block will have fewer samples if
len(array) % chunks != 0
.
Returns:  out : ndarray
Array of fill_value with the same shape and type as a.
See also
zeros_like
 Return an array of zeros with shape and type of input.
ones_like
 Return an array of ones with shape and type of input.
empty_like
 Return an empty array with shape and type of input.
zeros
 Return a new array setting values to zero.
ones
 Return a new array setting values to one.
empty
 Return a new uninitialized array.
full
 Fill a new array.

dask.array.
gradient
(f, *varargs, **kwargs)¶ Return the gradient of an Ndimensional array.
This docstring was copied from numpy.gradient.
Some inconsistencies with the Dask version may exist.
The gradient is computed using second order accurate central differences in the interior points and either first or second order accurate onesides (forward or backwards) differences at the boundaries. The returned gradient hence has the same shape as the input array.
Parameters:  f : array_like
An Ndimensional array containing samples of a scalar function.
 varargs : list of scalar or array, optional
Spacing between f values. Default unitary spacing for all dimensions. Spacing can be specified using:
 single scalar to specify a sample distance for all dimensions.
 N scalars to specify a constant sample distance for each dimension. i.e. dx, dy, dz, …
 N arrays to specify the coordinates of the values along each dimension of F. The length of the array must match the size of the corresponding dimension
 Any combination of N scalars/arrays with the meaning of 2. and 3.
If axis is given, the number of varargs must equal the number of axes. Default: 1.
 edge_order : {1, 2}, optional
Gradient is calculated using Nth order accurate differences at the boundaries. Default: 1.
New in version 1.9.1.
 axis : None or int or tuple of ints, optional
Gradient is calculated only along the given axis or axes The default (axis = None) is to calculate the gradient for all the axes of the input array. axis may be negative, in which case it counts from the last to the first axis.
New in version 1.11.0.
Returns:  gradient : ndarray or list of ndarray
A set of ndarrays (or a single ndarray if there is only one dimension) corresponding to the derivatives of f with respect to each dimension. Each derivative has the same shape as f.
Notes
Assuming that \(f\in C^{3}\) (i.e., \(f\) has at least 3 continuous derivatives) and let \(h_{*}\) be a nonhomogeneous stepsize, we minimize the “consistency error” \(\eta_{i}\) between the true gradient and its estimate from a linear combination of the neighboring gridpoints:
\[\eta_{i} = f_{i}^{\left(1\right)}  \left[ \alpha f\left(x_{i}\right) + \beta f\left(x_{i} + h_{d}\right) + \gamma f\left(x_{i}h_{s}\right) \right]\]By substituting \(f(x_{i} + h_{d})\) and \(f(x_{i}  h_{s})\) with their Taylor series expansion, this translates into solving the following the linear system:
\[\begin{split}\left\{ \begin{array}{r} \alpha+\beta+\gamma=0 \\ \beta h_{d}\gamma h_{s}=1 \\ \beta h_{d}^{2}+\gamma h_{s}^{2}=0 \end{array} \right.\end{split}\]The resulting approximation of \(f_{i}^{(1)}\) is the following:
\[\hat f_{i}^{(1)} = \frac{ h_{s}^{2}f\left(x_{i} + h_{d}\right) + \left(h_{d}^{2}  h_{s}^{2}\right)f\left(x_{i}\right)  h_{d}^{2}f\left(x_{i}h_{s}\right)} { h_{s}h_{d}\left(h_{d} + h_{s}\right)} + \mathcal{O}\left(\frac{h_{d}h_{s}^{2} + h_{s}h_{d}^{2}}{h_{d} + h_{s}}\right)\]It is worth noting that if \(h_{s}=h_{d}\) (i.e., data are evenly spaced) we find the standard second order approximation:
\[\hat f_{i}^{(1)}= \frac{f\left(x_{i+1}\right)  f\left(x_{i1}\right)}{2h} + \mathcal{O}\left(h^{2}\right)\]With a similar procedure the forward/backward approximations used for boundaries can be derived.
References
[1] Quarteroni A., Sacco R., Saleri F. (2007) Numerical Mathematics (Texts in Applied Mathematics). New York: Springer. [2] Durran D. R. (1999) Numerical Methods for Wave Equations in Geophysical Fluid Dynamics. New York: Springer. [3] Fornberg B. (1988) Generation of Finite Difference Formulas on Arbitrarily Spaced Grids, Mathematics of Computation 51, no. 184 : 699706. PDF. Examples
>>> f = np.array([1, 2, 4, 7, 11, 16], dtype=float) # doctest: +SKIP >>> np.gradient(f) # doctest: +SKIP array([ 1. , 1.5, 2.5, 3.5, 4.5, 5. ]) >>> np.gradient(f, 2) # doctest: +SKIP array([ 0.5 , 0.75, 1.25, 1.75, 2.25, 2.5 ])
Spacing can be also specified with an array that represents the coordinates of the values F along the dimensions. For instance a uniform spacing:
>>> x = np.arange(f.size) # doctest: +SKIP >>> np.gradient(f, x) # doctest: +SKIP array([ 1. , 1.5, 2.5, 3.5, 4.5, 5. ])
Or a non uniform one:
>>> x = np.array([0., 1., 1.5, 3.5, 4., 6.], dtype=float) # doctest: +SKIP >>> np.gradient(f, x) # doctest: +SKIP array([ 1. , 3. , 3.5, 6.7, 6.9, 2.5])
For two dimensional arrays, the return will be two arrays ordered by axis. In this example the first array stands for the gradient in rows and the second one in columns direction:
>>> np.gradient(np.array([[1, 2, 6], [3, 4, 5]], dtype=float)) # doctest: +SKIP [array([[ 2., 2., 1.], [ 2., 2., 1.]]), array([[ 1. , 2.5, 4. ], [ 1. , 1. , 1. ]])]
In this example the spacing is also specified: uniform for axis=0 and non uniform for axis=1
>>> dx = 2. # doctest: +SKIP >>> y = [1., 1.5, 3.5] # doctest: +SKIP >>> np.gradient(np.array([[1, 2, 6], [3, 4, 5]], dtype=float), dx, y) # doctest: +SKIP [array([[ 1. , 1. , 0.5], [ 1. , 1. , 0.5]]), array([[ 2. , 2. , 2. ], [ 2. , 1.7, 0.5]])]
It is possible to specify how boundaries are treated using edge_order
>>> x = np.array([0, 1, 2, 3, 4]) # doctest: +SKIP >>> f = x