Spaceland: access shapefiles in Python

Spaceland is a modern Python library for fast, Pythonic access to ESRI shapefiles.

Or, at least, that’s what it will be. While Spaceland is in active development it’s still early days and the library isn’t yet feature-complete. What it does support is reading dBase III files (for why this is important, see What’s a shapefile?).

Spaceland is developed on GitHub and contributions are welcome.

What’s a shapefile?

Created by ESRI in the early 1990s, the shapefile is a historical data format that has outlived its technical usefulness and yet persists as the lingua franca of most geospatial tools.

Aim of the library

The aim of Spaceland is to provide the fastest and most idiomatic method of reading ESRI shapefiles in Python 3. To support that aim, the objectives are:

  • Read all shape/record from shapefiles as fast as possible
  • Be written in idiomatic Python 3 and provide a modern Python 3 interface to shapefiles
  • Use built-in types as much as possible (you shouldn’t have to know about shapefile/dBase internals to use the data)
  • Provide a high-level interface to data in zipped shapefiles
  • Provide a low-level interface for those that need it
  • Let people convert shapefile data into a more modern formats
  • Integrate with orher Python geospatial libraries
  • Include as close to 100% test coverage as possible
  • Include high-quality documentation on the library and the shapefile format

For further details, see the roadmap.

What it won’t do

Spaceland is read-only. The shapefile should be considered a historical data format since it’s not well-suited to our Web-focussed world, and more suitable formats are now available (e.g. GeoJSON, TopoJSON, geospatial databases).

Spaceland won’t convert between coordinate systems, nor will it manipulate or analyse the data. But it should integrate with packages that do.

Roadmap

The current version is 0.1.0-dev. Here’s the roadmap for getting to version 1.0.0.

Version 0.1.0

  • Read two-dimensional points, poly lines and polygons from shapefiles
  • Publish package on PyPI

Version 0.2.0

  • Read from ZIP files directly
  • High-level interface for reading shapes and records
  • Installation and quick start documentation

Version 0.3.0

  • Support shape indexes (.shx files)
  • Include projection metadata from .prj files
  • Automatically select dBase III character-encoding from .cpg files

Version 0.4.0

  • Convert shapefiles to GeoJSON
  • Command-line interface for converting shapefiles to GeoJSON
  • Documentation on the command-line interface
  • Integrate with Shapely

Version 0.5.0

  • Support multi points
  • Support three-dimensional points, poly lines, polygons, and multi points

Version 0.6.0

  • Support measures on points, poly lines, polygons, and multi points

Version 1.0.0

  • Support surface patches (MultiPatch)

Installation

To do.

Quick start

To do.

Command-line interface

The dbfr command allows you to read records from a dBASE III file:

usage: dbfr [-h] [--encoding ENCODING] [--delimiter DELIMITER] [--quote QUOTE]
            [--quote-always] [--escape ESCAPE] [--no-header] [--crlf]
            filename

Convert a dBase III file to CSV.

positional arguments:
  filename

optional arguments:
  -h, --help            show this help message and exit
  --encoding ENCODING, -e ENCODING
                        set encoding used to decode the DBF input
  --delimiter DELIMITER, -d DELIMITER
                        set field separator for CSV output
  --quote QUOTE, -q QUOTE
                        set quote character for CSV output
  --quote-always        quote all fields in output
  --escape ESCAPE       set character used to escape a quote character
  --no-header, -n       don't output column names in the first row
  --crlf                use '\r\n' line endings in the output

Reading shapefile attributes from a DBF file

To do.

The spaceland package

The spaceland package — named after the three-dimensional world in Edwin Abbot’s book Flatland: A Romance of Many Dimensions — contains everything required to read ESRI shapefiles. It’s broken down into several core modules:

The spaceland.shp module

Read non-topological geometric records from the ESRI Shapefile format.

The Shapefile format was documented by ESRI in 1998 and is available in a document titled ESRI Shapefile Technical Description.

class spaceland.shp.Shapefile(shp: typing.IO[bytes]) → None

Read records from an ESRI shapefile.

A shapefile is a binary format created by ESRI in the early 1990s for storing non-topographical geometries. After a short header containing file metadata the geometries are stored in a sequence of individual records. The format is compact and fast to read but because it can’t contain indexes, details of the projection used, or metadata on individual shapes, it’s commonly accompanied by other files (e.g. a dBase III database for geometry metadata).

Class objects allow for iteration and can be used as context managers.

get_parse_function()

Return a function capable of parsing a particular type of shape.

The function returned will be suitable for parsing shapefile records of one type (e.g. two-dimensional points). The type is defined in the header of the shapefile, and so the returned function will handle all non-null records within a single shapefile.

Return type:Callable[[bytes], tuple]
records()

Yield all geometric records in the shapefile, one-by-one.

Records are returned in file order. Records are returned as a tuple, with the structure of the tuple dependent on the shape type. The structure of each shape type’s tuple is detailed in the shape parsing functions:

The appropriate parsing function for a file can be found using Shapefile.get_parse_function().

Return type:Iterable[tuple]
class spaceland.shp.ShapefileMeta(shape_type, x_min, y_min, x_max, y_max, z_min, z_max, m_min, m_max)
m_max

Alias for field number 8

m_min

Alias for field number 7

shape_type

Alias for field number 0

x_max

Alias for field number 3

x_min

Alias for field number 1

y_max

Alias for field number 4

y_min

Alias for field number 2

z_max

Alias for field number 6

z_min

Alias for field number 5

spaceland.shp.parse_null_record(content)

Parse a null shape record from a shapefile.

A null shape is an empty record with no geometric data. It can be used as a shape type for a shapefile but it’s also valid as a placeholder in a shapefile of any other type. That is, a shapefile of polygons can also incude null shape records. This is the only valid way a shapefile can contain multiple shape types.

Parameters:content (bytes) – An empty byte string
Return type:tuple
Returns:An empty tuple.
spaceland.shp.parse_point_record(content)

Parse a point shape record from a shapefile.

A point consists of a pair of double-precision coordinates ordered x, y.

Parameters:content (bytes) – 16 bytes containing two 64-bit IEEE double-precision floating-point numbers, in little-endian byte order.
Return type:tuple
Returns:An tuple containing a point in x, y order.

The spaceland.dbf module

Reads the subset of the dBase III file format used by ESRI shapefiles.

The dBase III format was never specified publicly but it has been reverse-engineered. The best documentation on the subject can be found at http://www.clicketyclick.dk/databases/xbase/format/dbf.html.

class spaceland.dbf.DbaseFile(dbf: typing.IO[bytes], encoding: str = 'ascii') → None

Read fields and records from a dBase III binary file.

A dBase III file is a simple tabular data format consisting of a header, fields (columns), and records (rows). Fields are typed; as used in the ESRI shapefile format, the records in a dBase III file must have one of five field types: string, float, integer, date, or boolean. All types allow null values.

Class objects allow for iteration and slicing, and they also work as context managers.

record(index)

Return the record at the given index.

Parameters:index (int) – The position of the record relative to the beginning of the file.
Return type:tuple
Returns:A namedtuple, each item matching one field in the record.
records(start=0)

Yield the records in the file.

A record is a set of fields and their values. The field names, types, and order are consistent across all records in the file.

It’s possible that a field has an invalid value (e.g. a non-numeric value in an integer field). When this happens the value becomes None and no error is raised.

Parameters:start (int) – The record from which to start iteration. By default starts with the first record in the file.
Yields:A namedtuple, each item matching one field in the record. Item names and order are consistent across records within the same file, but will differ between files.
Return type:Iterable[tuple]
spaceland.dbf.get_parse_str(encoding)

Return a function that decodes bytes to strings.

The returned function decodes the bytes using the character encoding passed to this function.

>>> utf8 = get_parse_str("UTF-8")
>>> utf8(b'\xf0\x9f\x91\x8d')
'👍'
Parameters:encoding (str) – The name of a character encoding that can be used to decode the bytes to a string.
Return type:Callable[[bytes], str]
Returns:A function that uses the given character encoding to convert bytes to strings.
spaceland.dbf.parse_bool(value)

Convert bytes to a boolean value.

Parameters:value (bytes) – A bytes value to be converted to a boolean value.
Return type:Optional[bool]
Returns:True if the bytes value is Y, y, T, or t; False if the bytes value is N, n, F, or f; None otherwise.
spaceland.dbf.parse_date(value)

Convert bytes in the format YYYYMMDD to a datetime.date object.

Parameters:value (bytes) – A bytes value to be converted to a date.
Return type:Optional[date]
Returns:A datetime.date object if the bytes value is a valid date, but None otherwise.
spaceland.dbf.parse_float(value)

Convert bytes to a float.

Parameters:value (bytes) – A bytes value to be converted to a float.
Return type:Optional[float]
Returns:A float if the bytes value is a valid numeric value, but None otherwise.
spaceland.dbf.parse_int(value)

Convert bytes to an integer.

Parameters:value (bytes) – A bytes value to be converted to an integer.
Return type:Optional[int]
Returns:An integer if the bytes value is a valid numeric value, but None otherwise.

The spaceland.cli module

Command-line interface to the library’s functionality.

This module provides the following functions that are registered as ‘console script’ entry points in setup.py:

  • dbf_to_csv(): convert dBase III files to CSVs (as command dbfr)

When the package is installed via setuptools (e.g. using pip install) the commands are immediately available to the user.

spaceland.cli.dbf_to_csv()

Read a dBase III file and convert it to a CSV.

Used as a ‘console script’ entry point in setup.py and available on the command-line as dbfr. The dBase III file named as an argument is parsed and converted to CSV, and output to stdout. The CSV dialect used can be configured using command-line options, as can the character-encoding used when reading the dBase file.

Return type:None
spaceland.cli.extant_file(arg)

Type-check an argument to ensure it names an existing file.

Return type:Path
spaceland.cli.single_char(arg)

Type-check an argument to ensure it’s a string of length one.

Return type:str
spaceland.cli.valid_codec(arg)

Type-check an argument to ensure it names an known codec.

Return type:str