Nomic

Nomic is small package manager-kind of applications that automatise our process of deployment and installation of analytics applications into Hadoop ecosystem. The analytics are packaged into archive file called box. These boxes have all files bundled inside and one descriptor file in root named nomic.box where is declared what and where is installed. The descriptor file is declarative DSL it’s based on Groovy.

Sitemap

Getting Started

Before you start creating own boxes and installing them into your Hadoop ecosystem, you have to get,install and setup the Nomic application.

Important

On first place check if your environment fulfills the most important requirement: Java 1.8

When the Java is present in your environment, we can get the application. The various distribution packages are available on Bintray. Pick the package depends on what kind of installation you will choose. For more details about installation on various systems, see the Installation section. For experimenting and playing around the Nomic, I recommend the TAR(or ZIP) distribution.

Download the latest version of application and unpack it to some working directory:

$ wget https://dl.bintray.com/sn3d/nomic-repo/nomic-{version}-bin.tar
$ tar -xzf ./nomic-{version}-bin.tar
$ cd ./nomic-{version}

Configuration

After unpacking you will need to configure Nomic application. First, you should copy the Hadoop code-site.xml and hdfs-site.xml into ./conf folder. The Nomic use these files for connecting to HDFS. If you’re not happy with copying these files, you can determine paths to files via configuration file conf/nomic.conf.

hdfs.core.site=/path/to/core-site.xml
hdfs.hdfs.site=/path/to/hdfs-site.xml

The default configuration is using user’s home folder. The boxes are installed into hdfs://${USER_HOME}/app folder and nomic store metadata into hdfs://${USER_HOME}/.nomic.

You can test you configuration by executing

$ ./bin/nomic config

After execution you should see how your Nomic instance is configured. If this command failed, probably you have something wrong in conf/nomic.conf or your core-site.xml and hdfs-site.xml are invalid.

Your first box

Now it’s time to create your first box. Let’s imagine we’ve got simple Oozie workflow.xml and we want to deploy it as analytics application. We have to create a nomic.box nearby workflow file with content:

group = "examples"
name = "simple-workflow"
version = "1.0.0"

hdfs {
  resource "workflow.xml"
}

For more information about how to write box descriptor files you can visit Nomic DSL Guide section. Then we need to pack both files into archive bundle. For that purpose we can use java JAR utility:

$ jar cf simple-workflow.nomic ./workflow.xml ./nomic.box

Huray! You’ve got your first box ready for deployment. Let’s deploy it.

Deploying and removing

In previous section we’ve created our first nomic box. We can deploy it easily by executing command:

$ ./bin/nomic install simple-workflow.nomic

In your HDFS, you should have workflow.xml available in application folder ${USER_HOME}/app/examples/simple-workflow. Also after executing ./bin/nomic list command, you will see the box was installed and what version of box is available. We should see output:

$ ./bin/nomic list
examples:simple-workflow:1.0.0

One of the primary goals of Nomic application is not only deploying but also safe removing of deployed boxes. Let’s remove our box:

$ ./bin/nomic remove examples:simple-workflow:1.0.0

The remove command will erase only these resources, they were deployed. It’s inverse of deploy command.

Installation

Important

On first place check if your environment fulfills the most important requirement: Java 1.8

Standalone Installation (Windows)

For Windows systems, you can yse standalone distribution available as ZIP archive. All you need is just download the ZIP from Bintray, unpack it into some folder and run bin\nomic.bat.

Standalone Installation (Linux/Mac OS)

For Linux and Mac OS systems, you can use standalone distribution available as TAR archive. All you need is just download it, unpack it and run.

$ wget https://dl.bintray.com/sn3d/nomic-repo/nomic-{version}-bin.tar
$ tar -xzf ./nomic-{version}-bin.tar
$ cd ./nomic-{version}
$ ./bin/nomic

All configuration files, libraries and shell scripts are placed in one folder.

Installation from RPM (RedHat)

If you’re using Linux system with YUM or RPM, you can install application as RPM directly:

$ sudo rpm -i https://dl.bintray.com/sn3d/nomic-repo/nomic-{version}.noarch.rpm
$ nomic

Or you can add YUM repository with Nomic application:

$ wget https://bintray.com/sn3d/nomic-rhel/rpm -O bintray-sn3d-nomic-rhel.repo
$ sudo mv bintray-sn3d-nomic-rhel.repo /etc/yum.repos.d/

And then you can use YUM for installing/upgrading:

$ sudo yum install nomic
$ nomic

Now you can type nomic in shell. You should see the nomic’s output. The configuration is placed in /etc/nomic. The main application is placed in /usr/share/nomic folder.

Configuration

The Nomic configuration can be divided into 2 parts:

  • environment configuration
  • application configuration

Environment configuration

The environment configuration is about setting few important envirnoment variables:

  • $JAVA_HOME: path to java JRE which will be used
  • $NOMIC_HOME: point to place where is Nomic application installed.
  • $NOMIC_CONF: path to nomic.conf file. The default value is $NOMIC_HOME/conf/nomic.conf but you can specify alternative file.
  • $NOMIC_OPTS: Java runtime options used when Nomic application is executed

All these environment variables can be set as global env. variables via system or you can override them in $NOMIC_HOME/conf/setenv.

Application configuration

The application configuration is more detailed configuration that is happening in the nomic.conf file. The configuration file is following HOCON format and Nomic application will look for this file in the $NOMIC_HOME/conf folder.

This example show all of the default values:

################################################################################
# General Nomic configuration
################################################################################

# user what will be used by Nomic application. Also folders and files in HDFS
# will be owned by this used.
#
# default value is system user you're logged with
nomic.user = ${user.name}

# Nomic home on FileSystem where nomic will look for configuration, write
# logs etc.
#
# default value is in user's home folder as '.nomic' folder
nomic.home = ${user.home}"/.nomic"

# This is home folder that will be used by Nomic in HDFS.
#
# default value is hdfs://server/user/{user.name}
nomic.hdfs.home = "/user/"${nomic.user}

# Point to 'app' folder in HDFS where are all boxes deployed
#
# default value is hdfs://server/user/{user.name}/app
nomic.hdfs.app.dir = ${nomic.hdfs.home}"/app"

# Point to 'repository' folder in HDFS where Nomic will store metadata
#
# default value is hdfs://server/user/{user.name}/.nomic
nomic.hdfs.repository.dir = ${nomic.hdfs.home}"/.nomic"


################################################################################
# HDFS configuration
################################################################################

# What kind of adapter to HDFS will be used. The possible values are 'hdfs' and 'simulator'.
# Simulator just simulate HDFS on your FileSystem. It's useful for debuging etc..
hdfs.adapter = "hdfs"

# The directory where simulator adapter will store all files. Make sence to set it
# if you set adater to 'simulator'.
hdfs.simulator.basedir = ${nomic.home}"/hdfs"

# Point to Hadoop core configuration file. It's relevant for real 'hdfs' adapter.
hdfs.core.site = ${nomic.home}"/conf/core-site.xml"

# Point to Hadoop HDFS configuration file. It's relevant for real 'hdfs' adapter.
hdfs.hdfs.site = ${nomic.home}"/conf/hdfs-site.xml"



################################################################################
# HIVE configuration
################################################################################

# host for Hive JDBC
hive.host = "localhost:10000"

# this is JDBC connection string to HIVE
hive.jdbc.url = "jdbc:hive2://"${hive.host}

# The default HIVE schema
hive.schema = ${nomic.user}

# Username for HIVE connection
hive.user = ${nomic.user}

# Password for HIVE connection
hive.password = ""



################################################################################
# OOZIE configuration
################################################################################

# URL that point to oozie server where is Oozie REST API available. It's
# just hostname and port without `/oozie/v1` postfix. This postfix is handled
# by application itself
oozie.url = "https://localhost:11000"

# Job tracker URL that will be used as pre-filled value when you submitting
# Oozie job (coordinator)
oozie.jobTracker = "localhost:8032"

Nomic DSL Guide

In this guide I would like to explain main cocepts of Nomic DSL and you will find informations how to write nomic.box descriptor files. We’re using Groovy as main language here and we created declarative DSL. The descriptors are declarative way how to tell application what can be deployed and removed. The core of descriptor file is collection of items called Facts. Each fact can be installed and reverted. The minimum descriptor script contains 3 required facts: the box name, the box group and version.

group = "app"
name = "some-box"
version = "1.0.0"

Variables in descriptor

In your descriptor scripts, you can also use some useful global variables:

  • group box group can be any string. You can also structure groups with / character.
  • name is box name that must be unique because it’s used for identification
  • version box version
  • user username the nomic is using for installation/uploading files into HDFS configured in nomic.conf
  • homeDir each used in HDFS might have his own home directory. It’s usefull when you want to sandboxing your applications/analyses.
  • appDir path to application directory in HDFS where are applications installed. The default value is ${homeDir}/app
  • nameNode the hostname of Name Node (it’s value of fs.defaultFS parameter in your Hadoop configuration)

Also each module (Hive, Hdfs etc) can expose own parameters.

  • hiveSchema contain default HIVE schema that is configure via nomic.conf. Also good if you want to sandboxing your apps.
  • hiveJdbcUrl value of hive.jdbc.url in nomic.conf that is used by Hive facts.
  • hiveUser value of hive.user in nomic.conf that is used by Hive facts.

Modules & dependencies

You can create an application box with multiple modules. This is useful especially for larger applications when you need to organize your content. There is also second use case of modules. Because facts inside the Box don’t know nothing about dependencies, you can solve your dependency problem via modules as well.

Let’s consider we’ve got our application ‘best-analytics’ with some resources and with ./nomic.box:

group = "mycompany"
name = "best-analytics"
version = "1.0.0"

hdfs {
  ...
}

The box is build via command:

$ jar cf best-analytics.nomic ./*

Let’s imagine we would like to split the content into two modules kpi and rfm. We will create a 2 new folders with own nomic.box they will represents our new modules.

The ./kpi/nomic.box:

group = "mycompany"
name = "kpi"
version = "1.0.0"

...

and the ./rfm/nomic.box:

group = "mycompany"
name = "rfm"
version = "1.0.0"

...

The final step is to declare these 2 new folders as modules in main ./nomic.box:

group = "mycompany"
name = "best-analytics"
version = "1.0.0"

module 'kpi'
module 'rfm'

The module fact ensure the main application box will have 2 new dependencies they will be installed before any resource in main box. That means the installation install each module first and then the best-analytics. When we install this bundle, we should see 3 new modules:

$ ./bin/nomic install best-analytics.nomic
$ ./bin/nomic list
mycompany:best-analytics:1.0.0
mycompany:kpi:1.0.0
mycompany:rfm:1.0.0

Also removing of best-analytics will remove all modules in right order.

Sometimes we also need to tell that our rfm module depends on kpi. That can be achieved via require fact. Let’s modify our ./rfm/nomic.box:

group = "mycompany"
name = "rfm"
version = "1.0.0"

require name: "kpi", group: this.group, version: $this.version

Now the rfm module need kpi first what means the kpi module will be installed first.

Factions

Maybe you realized there is no way how to set order how facts are executed. The solution is faction. The Factions are small blocks/groups of facts. Each faction has own unique ID in box and might depend on another faction.

Let’s imagine you want to ensure the resources first and then create some hive tables.

group = "mycompany"
name = "rfm"
version = "1.0.0"

faction ("resources") {
  resource 'file-1.csv'
  resource 'file-2.csv'
}

faction ("hivescripts", dependsOn = "resources") {
  table 'authors' from "create_authors_table.q"
}

Everything declared outside the faction blocks is considered as global facts and it’s executed first. The factions are executed after all these global facts.

group = "mycompany"
name = "rfm"
version = "1.0.0"

faction ("resources") {
  resource 'file-2.csv'
}

resource "file-1.csv"

In this example, the file-1.csv fact will be applied first even it’s declared after the faction.

Facts

Resource

The resource fact is declaring which resource from your box will be uploaded to where in HDFS. Let’s imagine we’ve got box archive like:

/nomic.box
/some-file.xml

The descriptor below will install the some-file.xml into application’s folder (depends how it’s configured).

group = "app"
name = "some-box"
version = "1.0.0"

hdfs {
    resource 'some-file.xml'
}

With small modification you can place any resource to any path. E.g. following example will demonstrate how to place some file to root /app:

hdfs {
    resource 'some-file.xml' to '/app/workflow.xml'
}

If you don’t place / character, the file will be paced into working directory that is basically ${appDir}.

hdfs {
    resource 'some-file.xml' to 'workflows/some-workflow.xml'
}

The example above will ensure the file in ${appDir}/workflows/some-workflow.xml where the some-file.xml content will be copied.

Also you can redefine the default working directory:

hdfs("/path/to/app") {
    resource 'some-file.xml'
}

This example above will install some-file.xml into /path/to/app/some-file.xml

As I mentioned, the facts are can be installed and uninstalled. In the resource case, uninstall means the file will be removed. Anyway you can mark file by setting property keepIt to true and uninstall will keep the file:

hdfs("/path/to/app") {
    resource 'some-file.xml' keepIt true
}
Dir

You can also declare presence of directory via dir fact. The declaration will create empty new directory if is not present yet.

hdfs {
    dir "data"
}

Because path start without / character, the directory will be created in current working directory. This declaration also ensure uninstalling that means the folder will be removed when uninstall or upgrade. If you wish to keep it, you can use the keepIt parameter:

hdfs {
    dir "data" keepIt true
}
Table

You can declare in descriptor also facts for HIVE. You can declare tables, schemes, you can also ensure the Hive scripts executions. Everything for Hive must be wrapped in hive.

Following example show how to create simple table in default schema you have configured in nomic.conf:

group = "app"
name = "some-box"
version = "1.0.0"

hive {
    table 'authors' from "create_authors_table.q"
}

In you box, you need to have the hive qurey file create_authors_table.q that will create table if it’s not present in system:

CREATE EXTERNAL TABLE authors(
  NAME STRING,
  SURNAME STRING
)
STORED AS PARQUET
LOCATION '/data/authors';

In your hive scripts you can use placeholders they will be replaced with values from descriptor. Values are declared via fields. This is sometime usefull when you want e.g. place table into some schema.

hive {
    fields 'APP_DATA_DIR': "${appDir}/data", 'DATABASE_SCHEMA': defaultSchema
    table 'authors' from "create_authors_table.q"
}

The create_authors_table.q then use these placeholders:

CREATE EXTERNAL TABLE ${DATABASE_SCHEMA}.authors(
  NAME STRING,
  SURNAME STRING
)
STORED AS PARQUET
LOCATION '${APP_DATA_DIR}/authors';
Schema

This fact create Hive schema during installation and drop this schema during uninstall procedure. This fact is useful if you want to declare multiple schemas or if you don’t want to rely on default schema.

hive {
   schema 'my_schema'
}

As I mentioned the example above will drop the schema during uninstall process that means also during upgrading. If you want to prevent this, you can mark schema with keepIt.

hive {
   schema 'my_schema' keepIt true
}

You can also declare schemas in hive block. In this case, the schema will be used as default schema across all facts inside hive block. Also you might have multiple blocks. The example below demonstrate more complex usage of schemas.

hive("${user}_${name}_staging") {
    table 'some_table' from 'some_script.q'
}

hive("${user}_${name}_processing") {
    fields 'DATABASE_SCHEMA': "${user}_${name}_processing"
    table 'some_table' from 'some_script.q'
}

hive("${user}_${name}_archive") {
    table 'some_table' from 'some_script.q'
}

This descriptor script will ensure 3 schemas where name of schema will be created as composition of user name, box name and some postfix. As you can see, each section might have own fields declaration.

Coordinator

The Nomic application is also integrate Oozie. You can declare the Oozie coordinator that is acting similar as resource but also submitting the coordinator with parameters. This fact also ensure the coordinator will be stoped during removing.

Let’s assume we’ve got simple coordinator available as coordinator.xml in our Box. In description file we will declare:

group = "examples"
name = "oozieapp"
version = "1.0.0"

oozie {
    coordinator "coordinator.xml" parameters SOME_PARAMETER: "value 1", "another.parameter": "value 2"
}

This example copy the XML into HDFS, into application folder and submit a coordinator job with given parameters like SOME_PARAMETER and also with following pre-filled parameters:

name value
user.name The user from Nomic configuration (e.g me)
nameNode The nameNode URL (e.g. hdfs://server:8020)
jobTracker Job tracker hostname from configuration with port (e.g. server:8032)
oozie.coord.application.path Path to coordinator XML in HDFS (e.g. /app/examples/oozieapp/coordinator.xml)