laforge: a low-key build system for data work.
Contents¶
Overview¶
laforge is a low-key build system designed to interoperate Python, SQL, and Stata data work, originally developed internally for IRIS, the Institute for Research on Innovation and Science at the University of Michigan’s Institute for Social Research.
Features:
Interoperable: Read, write, and execute Python, SQL, and Stata scripts/data.
Straightforward: Simple build INI files designed for a one-click build.
No lock-in: Maintain scripts independent from laforge.
An Example Build¶
Directory Organization¶
Because laforge needs to find the scripts, and because scripts will likely interact with output from other scripts, I tend to keep related project/sub-project files together:
project
├───input
│ ├───a.csv
│ ├───b.xlsx
│ └───c.csv
├───output
│ ├───results_d.csv
│ ├───results_e.csv
│ └───results_f.csv
├───.env
├───build.ini
├───g.py
├───h.sql
├───i.py
└───j.py
.env¶
distro = mssql
server = MSSQL
database = testdb
schema = laforge
build.ini¶
[DEFAULT]
Module Documentation¶
builder¶
Builder reads and executes tasks and lists of tasks.
-
laforge.builder.
run_build
(script_path, *, log, debug=False, dry_run=False)[source]¶ laforge’s core build command
-
class
laforge.builder.
FileCall
(method, kwargs)¶ -
property
kwargs
¶ Alias for field number 1
-
property
method
¶ Alias for field number 0
-
property
-
class
laforge.builder.
TaskList
(from_string, location='.')[source]¶ Todo
Implement cache_results=False
-
class
laforge.builder.
BaseTask
(*, identifier, verb, target, content, config)[source]¶ Create a task to (verb) (something)
Todo
- if “:” in self.content:
previous_result_key, actual_path_content = self.content.split(“:”)
-
property
path
¶ For handlers where dir[verb] + content = path
-
class
laforge.builder.
InternalPythonExecutor
(*, identifier, verb, target, content, config)[source]¶ Execute (without importing) Python script by path
This will set
__main__ = 'laforge'
..todo
Allow implict/explicit return of results.
command¶
Command-line interface for laforge.
sql¶
SQL utilities for mid-level interaction. Inspired by pathlib; powered by SQLALchemy.
Note
Supported: MSSQL, MariaDB/MySQL, PostgreSQL, SQLite. Supportable: Firebird, Oracle, Sybase.
-
class
laforge.sql.
Channel
(distro, *, server=None, database=None, schema=None, **engine_kwargs)[source]¶ Abstraction from Engine, other static details.
-
laforge.sql.
execute
(statement, fetch=False, channel=None)[source]¶ Convenience method, autofetches Channel if possible
-
class
laforge.sql.
Script
(query, channel=None)[source]¶ SQL query string, parsable by ‘go’ separation and execute()able.
-
to_table
()[source]¶ Executes all and tries to return a DataFrame for the result of the final query.
This is one of two ways that laforge retrieves tables.
Warning
This is limited by the capacity of Pandas to retrieve only the final result. For Microsoft SQL Server, if a lengthy set of queries is desired, the most reliable approach appears to be a single final query after a ‘go’ as a batch gterminator.
Warning
This will rename columns that do not conform to naming standards.
-
-
class
laforge.sql.
Table
(name, channel=None, **kwargs)[source]¶ Represents a SQL table, featuring methods to read/write DataFrames.
Todo
Factor out to superclass to allow views
tech¶
toolbox¶
Handful of utility functions
Note
These intentionally only depend on builtins.
Note
Some copyright information within this file is identified per-block below.
-
laforge.toolbox.
round_up
(n, nearest=1)[source]¶ Round up
n
to the nearestnearest
.- Parameters
n –
nearest – (Default value = 1)
-
laforge.toolbox.
prepare_to_access
(path)[source]¶ Make directory exist and verify that file would be writable
-
laforge.toolbox.
verify_file_is_writable
(path, retry_attempts=3, retry_seconds=5)[source]¶ Check for locked file (e.g. Excel has CSV open)
-
laforge.toolbox.
flatten
(foo)[source]¶ Take any set of nests in an iterator and reduce it into one generator.
‘Nests’ include any iterable except strings.
- Parameters
foo –
Note
flatten()
was authored by Amber Yust at https://stackoverflow.com/a/5286571. This function is not claimed under the laforge license.
Contributing to Development¶
laforge supports Python 3.6+.
Process |
Tool |
Documentation |
---|---|---|
Automation |
Nox |
|
Test |
pytest |
|
Test coverage |
pytest-cov |
|
Format |
Black |
|
Lint |
Flake8 |
|
List more |
Pylint |
|
Document |
Sphinx |
Suggested Environment¶
# Create virtual environment
python -m venv .venv
# Activate virtual environment with shell-specific script:
. .venv/bin/activate.fish # fish
# $ source ./.venv/bin/activate # bash
# source ./.venv/bin/activate.csh # csh
# Note that Python for Windows creates ./Scripts/ rather than ./bin/
# .\.venv\Scripts\Activate.ps1 # PowerShell
# .venv\Scripts\Activate.bat # cmd
# Install packages
python -m pip install -r requirements.txt
# Optional [packages] to include Excel and/or non-SQLite databases
python -m pip install -e .[mysql]
# Run tests
python -m pytest
# Run the gauntlet
python -m nox
Embedded TODOs¶
Todo
Implement cache_results=False
Todo
Restore quiet?
Todo
- if “:” in self.content:
previous_result_key, actual_path_content = self.content.split(“:”)
Todo
De-messify
Todo
Factor out to superclass to allow views
Todo
class InvalidIdentifierError relay_id_problem(identifier, action, reason=None, replacement=None)
Docstring Gaps¶
Undocumented Python objects
===========================
laforge.builder
---------------
Functions:
* find_build_config_in_directory
* get_package_logger
* get_verb
* is_verb
* run_cmd
* seconds_since
Classes:
* BaseTask -- missing methods:
- implement
- validate_results
* ExistenceChecker -- missing methods:
- implement
* FileReader -- missing methods:
- implement
* FileWriter -- missing methods:
- implement
- write
* InternalPythonExecutor -- missing methods:
- implement
* SQLExecutor -- missing methods:
- implement
* SQLQueryReader -- missing methods:
- implement
* SQLReaderWriter -- missing methods:
- implement
* ShellExecutor -- missing methods:
- implement
* StataExecutor -- missing methods:
- implement
* Task
* TaskList -- missing methods:
- load_tasks
- template_content
laforge.command
---------------
Functions:
* user_confirms_cleartext
laforge.distros
---------------
Classes:
* Distro
* MSSQL
* MySQL
* PostgresQL
* SQLite
laforge.sql
-----------
Functions:
* fix_bad_columns
Classes:
* Channel -- missing methods:
- clean_up_statement
- find
- grab
- retrieve_engine
- save_engine
* Identifier -- missing methods:
- check
* Script -- missing methods:
- read
* Table -- missing methods:
- exists
- resolve
laforge.tech
------------
Functions:
* capitalize_sentences
* nobabble
Classes:
* ModifiableVerb
* Technobabbler
laforge.toolbox
---------------
Functions:
* is_reserved_word
Free Software License¶
Copyright 2019 Matt VanEseltine.
laforge is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
laforge is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
Copies are attached with this documentation
and
available online at https://www.gnu.org/licenses/agpl.html.