Welcome to databuild’s documentation!

Databuild is an automation tool for data manipulation.

The general principles in Databuild are:

  • Low entry barrier
  • Easy to install
  • Easy to grasp
  • Extensible

Databuild can be useful for scenarios such as:

  • Documenting data transformations in your infoviz project
  • Automate data processing in a declarative way

Contents

Installation

Install Databuild:

$ pip install databuild

Quickstart

Run databuild using a buildfile:

$ data-build.py buildfile.json

buildfile.json contains a list of operations to be performed on data. Think of it as a script for a spreadsheet.

An example of build file could be:

[
  {
    "operation": "sheets.import_data",
    "description": "Importing data from csv file",
    "params": {
      "sheet": "dataset1",
      "format": "csv",
      "filename": "dataset1.csv",
      "skip_last_lines": 1
    }
  },
  {
    "operation": "columns.add_column",
    "description": "Calculate the gender ratio",
    "params": {
      "sheet": "dataset1",
      "name": "Gender Ratio",
      "expression": {
        "language": "python",
        "content": "return float(row['Male Total']) / float(row['Female Total'])"
      }
    }
  },
  {
    "operation": "sheets.export_data",
    "description": "save the data",
    "params": {
      "sheet": "dataset1",
      "format": "csv",
      "filename": "dataset2.csv"
    }
  }
]

YAML buildfiles are also supported. Databuild will guess the type based on the extension.

Philosophy

Databuild is an alternative to more complex and complete packages like pandas, numpy, and R.

It’s aimed at users that are not necessarily data scientist and are looking to a simpler alternative to such softwares.

It’s admittely less performant than those and is not optimized for huge datasets. But Databuild is much easier to get started with.

The general principles in Databuild are:

  • Low entry barrier
  • Easy to install
  • Easy to grasp
  • Extensible

Databuild can be useful for scenarios such as:

  • Documenting data transformations in your infoviz project
  • Automate data processing in a declarative way

Buildfiles

A buildfile contains a list of operations to be performed on data. Think of it as a script for a spreadsheet.

JSON and YAML format are supported. databuild will guess the format based on the file extension.

An example of build file could be:

[
  {
    "operation": "sheets.import_data",
    "description": "Importing data from csv file",
    "params": {
      "sheet": "dataset1",
      "format": "csv",
      "filename": "dataset1.csv",
      "skip_last_lines": 1
    }
  },
  {
    "operation": "columns.add_column",
    "description": "Calculate the gender ratio",
    "params": {
      "sheet": "dataset1",
      "name": "Gender Ratio",
      "expression": {
        "language": "python",
        "content": "return float(row['Male Total']) / float(row['Female Total'])"
      }
    }
  },
  {
    "operation": "sheets.export_data",
    "description": "save the data",
    "params": {
      "sheet": "dataset1",
      "format": "csv",
      "filename": "dataset2.csv"
    }
  }
]

The same file in yaml:

- operation: sheets.import_data
  description: Importing data from csv file
  params:
    sheet: dataset1
    format: csv
    filename: dataset1.csv
    skip_last_lines: 1
- operation: columns.add_column
  description: Calculate the gender ratio
  params:
    sheet: dataset1
    name: Gender Ratio
    expression:
      language: python
      content: "return float(row['Male Total']) / float(row['Female Totale'])"
- operation: sheets.export_data
  description: save the data
  params:
    sheet: dataset1
    format: csv
    filename: dataset2.csv

Python API

Databuild can be integrated in your python project. Just import the build function:

from databuild.builder import build

build('buildfile.json')
Supported arguments:
  • build_file Required. Path to the buildfile.
  • settings Optional. Python module path containing the settings. Defaults to datbuild.settings
  • echo Optional. Set this to True if you want the operations’ description printed to the screen. Defaults to False.

Operation Functions

Operations functions are regular Python function that perform actions on the book. Examples of operations are: sheets.import_data, columns.add_column, columns.update_column, and more.

They have a function name that identifies them, an optional description and a number of parameters that they accept. Different operation functions accept different parameters.

Available Operation Functions

sheets.import_data

Creates a new sheet importing data from an external source.

arguments:
  • filename: Required.
  • sheet: Optional. Defaults to filename‘s basename.
  • format: Values currently supported are 'csv' and 'json'.
  • headers: Optional. Defaults to null, meaning that the importer tries to autodetects the header names.
  • encoding: Optional. Defaults to 'utf-8'.
  • skip_first_lines: Optional. Defaults to 0. Supported only by the CSV importer.
  • skip_Last_lines: Optional. Defaults to 0. Supported only by the CSV importer.
  • guess_types: Optional. If set to true, the CSV importer will try to guess the data type. Defaults to true.
sheets.copy
arguments:
  • source
  • destination
  • headers (optional)

Create a copy of the source sheet named destination. Optionally copies only the headers specified in headers.

sheets.export_data
arguments:
  • sheet
  • format
  • filename
  • headers (optional)

Exports the datasheet named sheet to the file named filename in the specified format. Optionally exports only the headers specified in headers.

sheets.print_data
arguments:
  • sheet
columns.update_column
arguments:
  • sheet
  • column
  • facets (optional)
  • values
  • expression

Either values or expression are required.

columns.add_column
arguments:
  • sheet
  • name
  • expression (optional)
columns.remove_column
arguments:
  • sheet
  • name
columns.rename_column
arguments:
  • sheet
  • old_name
  • new_name
columns.to_float
arguments:
  • sheet
  • column
  • facets (optional)
columns.to_integer
arguments:
  • sheet
  • column
  • facets (optional)
columns.to_decimal
arguments:
  • sheet
  • column
  • facets (optional)
columns.to_text
arguments:
  • sheet
  • column
  • facets (optional)
columns.to_datetime
arguments:
  • sheet
  • column
  • facets (optional)

Custom Operation

You can add your custom operation and use them in your buildfile.

An Operation is just a regular python function. The first arguments has to be the workbook, but the remaining arguments will be pulled in from the params property of the operation in the buildfile.

def myoperation(workbook, foo, bar, baz):
    pass

Operations are defined in modules, which are just regulare Python files.

As long as your operation modules are in your PYTHONPATH, you can add them to your OPERATION_MODULES setting (see operation-modules-setting) and then call the operation in your buildfile by referencing its import path:

[
    ...,
    {
        "operation": "mymodule.myoperation",
        "description": "",
        "params": {
            "foo": "foos",
            "bar": "bars",
            "baz": "bazes"
        }
    }
]

Expressions

Expressions are objects encapsulating code for situations such as filtering or calculations.

An expression has the following properties:

  • language: The name of the environment where the expression will be executed, as specified in settings.LANGUAGES. See LANGUAGES).
  • content: The actual code to run, or
  • path: path to a file containing the code to run

The expression will be evaluated inside a function and run against every row in the datasheet. The following context variables will be avalaible:

* ``row``: A dictionary representing the currently selected row.

Environments

Expressions are evaluated in the environment specified by their language property.

The value maps to a specific environment as specified in settings.LANGUAGES (See the LANGUAGES setting).

Included Environments

Currently, the following environments are shipped with databuild:

Python

Unsafe Python environment. Use only with trusted build files.

Writing Custom Environments

An Environment is a subclass of databuild.environments.base.BaseEnvironment that implements the following methods:

  • __init__(self, book): Initializes the environment with the appropriate global variables.
  • copy(self, iterable): Copies a variable from the databuild process to the hosted environment.
  • eval(self, expression): Evaluates the string expression to an actual functions and returns it.

Add-on Environments

Lua

An additional Lua environment is available at http://github.com/databuild/databuild-lua

Requires Lua or LuaJIT (Note: LuaJIT is currently unsupported on OS X).

Functions

Functions are additional methods that can be used inside Expressions.

Available Functions

cross

Returns a single value from a column in a different sheet.

arguments:
  • row reference to the current row
  • sheet_source name of the sheet that you want to get the data from
  • column_source name of the column that you want to get the data from
  • column_key name of the sheet that you want to match the data between the sheets.
column

Returns an array of values from column from a different dataset, ordered as the key.

arguments:
  • sheet_name name of the current sheet
  • sheet_source name of the sheet that you want to get the data from
  • column_source name of the column that you want to get the data from
  • column_key name of the sheet that you want to match the data between the sheets.

Custom Functions Modules

You can write your own custom functions modules.

A function module is a regulare Python module containing Python functions with the following signature:

def myfunction(environment, book, **kwargs)

Function must accept the environment and book positional arguments. After them, everything other argument is up the the function.

Another reuqirement is that the function must return a value wrapped into the environment’s copy method:

return environment.copy(my_return_value)

Function modules must be made available by adding them to the FUNCTION_MODULES Settings variable.

Settings

ADAPTER

Classpath of the adapter class. Defaults to 'databuild.adapters.locmem.models.LocMemBook'.

LANGUAGES

A dict mapping languages to Environments. Default to:

LANGUAGES = {
    'python': 'databuild.environments.python.PythonEnvironment',
}

FUNCTION_MODULES

A tuple of module paths to import Functions from. Defaults to:

FUNCTION_MODULES = (
    'databuild.functions.data',
)

OPERATION_MODULES

A tuple of module paths to import Operation Functions from. Defaults to:

OPERATION_MODULES = (
    "databuild.operations.sheets",
    "databuild.operations.columns",
)

Indices and tables