rastercube documentation

rastercube is a python package to store large geographical raster collections on the Hadoop File System (HDFS). See the Getting started for more.

Getting started

rastercube is a python package to store large geographical raster collections on the Hadoop File System (HDFS). The initial use case was to store and quickly access MODIS NDVI timeseries for any pixel on the whole planet. In addition, rastercube provides facility to process the data using Spark.

It allows you to turn a set of raster images (e.g. MODIF HDF files) into a 3D cube that enable efficient querying of pixels time series.

_images/ndvi.png

It was initially written to perform pantropical land-cover change detection in the terra-i project, based on MODIS NDVI imagery.

The central component is the rastercube.jgrid.jgrid3.Header class

All the geographical computations are handled by gdal.

Environment variables

Example:

export RASTERCUBE_TEST_DATA=$HOME/rastercube_testdata
export RASTERCUBE_DATA=$TERRAI_DATA
export RASTERCUBE_WORLDGRID=$TERRAI_WORLDGRID
export RASTERCUBE_CONFIG=$TERRAI_DATA/1_manual/rastercube_config.py

Configuration

Rastercube’s configuration options are defined in the rastercube.config module. To override options, create a rastercube_config.py file on your file system and have the $RASTERCUBE_CONFIG environment variable point to it. For example:

export RASTERCUBE_CONFIG=$RASTERCUBE_DATA/1_manual/rastercube_config.py

With rastercube_config.py containing:

MODIS_HTTP_USER="foo"
MODIS_HTTP_PASS="bar"

Worldgrid

A worldgrid is a (chunked) georeferenced nD (n >= 2) array covering the whole earth. The first two dimensions are spatial dimensions and the third dimension is usually time.

The original use case for which the format was designed was to store MODIS NDVI timeseries for the pantropical areas (south america, africa and asia). MODIS NDVI images have a frequency of 8 days (if you use Terra and Aqua) and a ground resolution of 250x250m.

In the case of MODIS NDVI, there is also a quality indicator. So you would have two worldgrids : one containing NDVI and one containing quality. You could imagine more worldgrids if you were interested in other MODIS bands. You can also have worldgrid for other data types such as Global Land Cover data.

The rastercube package provides a simple API to store worldgrids on the Hadoop File System

Date management

Geographical rasters are usually downloaded as timestamped tiles. When creating a worldgrid, you want to ensure that you have the same dates available for each tile so you have a homogeneous time axis for the whole grid.

This is done by collecting the available date for a particular model tile and storing them in a .csv. See the ndvi_collect_dates.py script.

Note

If a particular MODIS tile is missing a date (this can happen due to satellite malfunction), you can just manually create a tile with all bad QA/NDVI

Input data files structure

The input data (e.g. the MODIS HDF tiles, the GLCF tiles) that are imported into worldgrid need to be stored on a traditional unix filesystem (or on NFS).

The environment variable $RASTERCUBE_DATA should point to the location where the data is stored.

For example, here is the layout used in the terra-i project. rastercube assumes that you have similar 0_input and 1_manual directories.

  • 0_input should contain all input raster data
  • 1_manual should contain some manually created files like the NDVI dates .csv
/home/julien/terra_i/sv2454_data
├── 0_input
│   ├── glcf_5.1
│   ├── hansen_dets
│   ├── landsat8
│   ├── MODIS_HDF
│   ├── modis_www_mirror
│   ├── prm
│   ├── terrai_dets
│   └── TRMM_daily
├── 1_manual
│   ├── hansen_wms
│   ├── ndvi_dates.csv
│   ├── ndvi_dates.short.csv
│   ├── ndvi_dates.terra_aqua.csv
│   ├── ndvi_dates.terra_aqua.short.csv
│   ├── ndvi_dates.terra.csv
│   └── qgis
├── 2_intermediate
│   ├── logs
│   ├── models
│   ├── models_rawndvi
│   └── terrai_dets
├── experimental
│   ├── glcf_dets
│   ├── glcf_dets_2
│   ├── land_cover
│   ├── qgis
│   ├── test2.hdf
│   └── test.hdf

Building the Cython modules

rastercube relies on Cython modules for some operation. Those cython modules need to be compiled. To do so, you can simply run the build.sh script in the root directory. There are a few gotchas though :

Building with Anaconda on OSX

If you are using OSX and use Anaconda’s python, you need to make sure that cython will build with Anaconda’s GCC and NOT clang or the system GCC. This is because Anaconda’s python is linked with libstdc++ (the GCC standard library) and OSX ships with a pre-C++11 version of libstdc++, which might lead to problems.

In short, ensure that if you run which gcc, it either points to Anaconda’s gcc or to a recent homebrew installed GCC.

You can control the C/C++ compiler used by setting the CC and CXX environment variables when building

CXX=/usr/local/bin/gcc-7 CC=$CXX ./build.sh

Building to use through Spark on a cluster

When you use rastercube with a Spark cluster, Spark will distribute the python egg (the output of build.sh) to all Spark worker nodes. Since we use native modules, you need to ensure that the egg is built for the correct CPU architecture. Assuming you have a homogeneous cluster, the easiest way to do so is to build rastercube and start the spark job from one of the cluster node.

Otherwise, if you build from a Linux machine, you need to make sure that GCC targets an architecture that is common to both your Linux machine and the cluster. You can play with the -march option in setup.py to target the correct CPU type, going all the way to march=i386 which should work but could lead to lower performance.

Examples

Warning

Ensure you have read the Configuration and Environment variables sections

Importing a new tile into HDFS

To import a new tile, you first want to download the HDF files for said tile.

Edit your configuration file

The first step is to modify your Configuration file to add the tile name to MODIS_TERRA_TILES.

For example:

MODIS_TERRA_TILES = [
    ...
    # central america
    h09v07,
    ...

Run ndvi_hdf_download.py

Now, run ndvi_hdf_download.py which will update the local cache of the MODIS tile listing and download the tabs specified in the config.

python rastercube/scripts/ndvi_hdf_download.py

Run create_ndvi_worldgrid.py

python rastercube/scripts/create_ndvi_worldgrid.py
    --tile=h09v07
    --worldgrid=hdfs:///user/terrai/worldgrid/
    --dates_csv=$TERRAI_DATA/1_manual/ndvi_dates.terra_aqua.csv

Updating the worldgrid when new dates are available

First, update your dates CSV:

python rastercube/scripts/ndvi_collect_dates.py
  --tile=h09v07
  --outfile=$TERRAI_DATA/1_manual/ndvi_dates.terra_aqua.csv

And then, for each tile, run:

python rastercube/scripts/complete_ndvi_worldgrid.py
    --tile=h09v07
    --worldgrid=hdfs:///user/terrai/worldgrid/
    --dates_csv=$TERRAI_DATA/1_manual/ndvi_dates.terra_aqua.csv

Example notebooks

Loading two different jgrid worldgrids and reprojecting one on another

This notebook shows how you can use jgrid_utils to load jgrid with different projection/geotransform into the same reference.

In [1]:
import pylab as pl
import numpy as np
import matplotlib.cm as cm
import terrapy
import rastercube.jgrid as jgrid
import rastercube.jgrid.utils as jgrid_utils
import rastercube.regions as regions
import rastercube.datasources.glcf as terra_glcf

%matplotlib inline
np.set_printoptions(precision=5, suppress=True)
In [2]:
glcf_header = jgrid.load('hdfs:///user/terrai/worldgrid/glcf/2004')
ndvi_header = jgrid.load('hdfs:///user/terrai/worldgrid/ndvi')
poly = regions.polygon_for_region("test_zones_1.h12v11_1")
Load in NDVI projection
In [3]:
reload(jgrid_utils)
ndvi_xy_from, ndvi_data, ndvi_mask, glcf_data, glcf_mask = \
    jgrid_utils.load_poly_latlng_from_multi_jgrids([ndvi_header, glcf_header], poly)
arr dtype :  uint8  => gdal dtype :  Byte
arr dtype :  uint8  => gdal dtype :  Byte
In [4]:
pl.figure(figsize=(12, 10))
pl.subplot(121)
pl.imshow(glcf_mask, cmap=cm.binary)
pl.subplot(122)
pl.imshow(ndvi_mask, cmap=cm.binary)

pl.figure(figsize=(12, 10))
pl.subplot(121)
pl.imshow(terra_glcf.glcf_to_rgb(glcf_data))
pl.subplot(122)
pl.imshow(ndvi_data[:,:,150])
Out[4]:
<matplotlib.image.AxesImage at 0x7f3ba9eb1610>
_images/notebooks_load_ndvi_glcf_6_1.png
_images/notebooks_load_ndvi_glcf_6_2.png
Load in GLCF projection
In [5]:
reload(jgrid_utils)
glcf_xy_from, glcf_data, glcf_mask, ndvi_data, ndvi_mask,  = \
    jgrid_utils.load_poly_latlng_from_multi_jgrids([glcf_header, ndvi_header], poly)
arr dtype :  int16  => gdal dtype :  Int16
arr dtype :  uint8  => gdal dtype :  Byte
In [6]:
pl.figure(figsize=(12, 10))
pl.subplot(121)
pl.imshow(glcf_mask, cmap=cm.binary)
pl.subplot(122)
pl.imshow(ndvi_mask, cmap=cm.binary)

pl.figure(figsize=(12, 10))
pl.subplot(121)
pl.imshow(terra_glcf.glcf_to_rgb(glcf_data.squeeze()))
pl.subplot(122)
pl.imshow(ndvi_data[:,:,150])
Out[6]:
<matplotlib.image.AxesImage at 0x7f3ba9dbc710>
_images/notebooks_load_ndvi_glcf_9_1.png
_images/notebooks_load_ndvi_glcf_9_2.png
In [ ]:

Loading NDVI and QA data

In [1]:
import os
import pylab as pl
import numpy as np
import matplotlib.cm as cm
import terrapy
import rastercube.jgrid as jgrid
import rastercube.jgrid.utils as jgrid_utils
import rastercube.datasources.modis as modis
import rastercube.regions as regions

%matplotlib inline
np.set_printoptions(precision=5, suppress=True)
In [2]:
#WORLDGRID = os.path.join(os.environ['TERRAI_DATA'], 'sv2455/jgrids/worldgrid/')
WORLDGRID = 'hdfs:///user/terrai/worldgrid'
ndvi_header = jgrid.load(WORLDGRID + '/ndvi')
qa_header = jgrid.load(WORLDGRID + '/qa')
#poly = regions.polygon_for_region("test_zones_1.h12v11_1")
poly = regions.polygon_for_region("test_zones_1.h12v09_1")
In [3]:
reload(jgrid_utils)
ndvi_xy_from, ndvi_data, ndvi_mask, qa_data, qa_mask = \
    jgrid_utils.load_poly_latlng_from_multi_jgrids([ndvi_header, qa_header], poly, progressbar=True)
0%  100%
[######] | ETA: 00:00:00
Total time elapsed: 00:00:12
0%  100%
[######] | ETA: 00:00:00
Total time elapsed: 00:00:12
In [4]:
pl.imshow(ndvi_data[:,:,150])
Out[4]:
<matplotlib.image.AxesImage at 0x7f4f9c7bd3d0>
_images/notebooks_load_ndvi_qa_4_1.png
In [5]:
# Convert the uint16 qa to 0-1 float confidence
qaconf = modis.qa_to_qaconf(qa_data)
print qa_data.dtype
print qaconf.dtype
uint16
float32
In [6]:
pl.imshow(qaconf[:,:,100], vmin=0, vmax=1); pl.colorbar()
Out[6]:
<matplotlib.colorbar.Colorbar at 0x7f4f991c1890>
_images/notebooks_load_ndvi_qa_6_1.png
In [7]:
i, j = 350, 450

pl.figure(figsize=(15, 3))
pl.plot(ndvi_data[i, j])
pl.ylim((0, 10000))
pl.ylabel('ndvi')
ax = pl.twinx()
ax.plot(qaconf[i, j], '.-', c='r')
ax.set_ylim((0, 1))
ax.set_ylabel('qaconf')

Out[7]:
<matplotlib.text.Text at 0x7f4f990e4550>
_images/notebooks_load_ndvi_qa_7_1.png

Scripts

Downloading HDF files

rastercube/scripts/ndvi_hdf_download.py

Stats about available HDF files

rastercube/scripts/ndvi_hdf_statspy

Creating a new NDVI worldgrid

rastercube/scripts/create_ndvi_worldgrid.py

Creating a new GLCF worldgrid

rastercube/scripts/create_glcf_worldgrid.py

Collecting the available NDVI dates to a .csv

rastercube/scripts/ndvi_collect_dates.py

Completing a NDVI worldgrid when the data gets updated

rastercube/scripts/complete_ndvi_worldgrid.py

Exporting the fractions of a grid to a shapefile

rastercube/scripts/worldgrid_fracs_to_shapefile.py

_images/qgis_ndvi_fracs.png

NDVI fractions shapefile viewed in QGIS

Printing informations about a worldgrid

rastercube/scripts/worldgrid_info.py

API

jgrid3

gdal_utils

imutils

regions