Hibari DB

A Distributed, Consistent, Ordered Key-Value Store

Hibari is a distributed, ordered key-value store with strong consistency guarantee. Hibari is written in Erlang and designed for being:

  • Fast, Read Optimized: Hibari serves read and write requests in short and predictable latency. Hibari has excellent performance especially for read and large value operations
  • High Bandwidth: Batch and lock-less operations help to achieve high throughput while ensuring data consistency and durability
  • Big Data: Can store Peta Bytes of data by automatically distributing data across servers. The largest production Hibari cluster spans across 100 of servers
  • Reliable: High fault tolerance by replicating data between servers. Data is repaired automatically after a server failure

Hibari is able to deliver scalable high performance that is competitive with leading open source NOSQL (Not Only SQL) storage systems, while also providing the data durability and strong consistency that many systems lack. Hibari’s performance relative to other NOSQL systems is particularly strong for reads and for large value (> 200KB) operations.

As one example of real-world performance, in a multi-million user webmail deployment equipped with traditional HDDs (non SSDs), Hibari is processing about 2,200 transactions per second, with read latencies averaging between 1 and 20 milliseconds and write latencies averaging between 20 and 80 milliseconds.

Distinct Features

Unlike many other distributed databases, Hibari uses “chain replication methodology” and delivers distinct features.

  • Ordered Key-Values: Data is distributed across “chains” by key prefixes, then keys within a chain are sorted by lexicographic order
  • Always Guarantees Strong Consistency: This simplifies creation of robust client applications
    • Compare and Swap (CAS): key timestamping mechanism that facilitates “test-and-set” type operations
    • Micro-Transaction: multi-key atomic transactions, within range limits
  • Custom Metadata: per-key custom metadata
  • TTL (Time To Live): per-key expiration times

Hibari’s Origins

Hibari was originally written by Cloudian, Inc., formerly Gemini Mobile Technologies, to support mobile messaging and email services. Hibari was open-sourced under the Apache Public License version 2.0 in July 2010.

Hibari has been deployed by multiple telecom carriers in Asia and Europe. Hibari may lack some features such as monitoring, event and alarm management, and other “production environment” support services. Since telecom operator has its own data center support infrastructure, Hibari’s development has not included many services that would be redundant in a carrier environment.

We hope that Hibari’s release to the open source community will close those functional gaps as Hibari spreads outside of carrier data centers.

Tip

What does Hibari mean? The word “Hibari” means skylark in Japanese; the Kanji characters stand for “cloud bird”.

A Quick Tour

TODO

User Documentation

Hibari Application Developer’s Guide (Hibari v0.1.11)

Date: 2015/03/22
Revision: 0.5.4

Copyright (C) 2005-2015 Hibari developers. All rights reserved.

Table of Contents

Introduction

Hibari is a production-ready, distributed, key-value, big data store. In the emerging field of “NOSQL” solutions to today’s mass-scale data storage challenges, Hibari stands out for several reasons:

  • Hibari is the only open source key-value database to couple Erlang engineering with innovative chain replication technology. Erlang is the ideal programming foundation on which to build a robust, high-performance distributed storage solution. Chain replication delivers high throughput and availability without sacrificing data consistency.
  • Hibari is the only open source key-value database built to the exacting standards of the carrier-class telecom sector, and proven in multi-million user telecom production environments.
  • Hibari delivers a distinctive feature matrix that includes:
    • Per-table options for RAM+disk-based or disk-only value storage
    • Support for per-key expiration times and per-key custom meta-data
    • Support for multi-key atomic transactions, within range limits
    • A key timestamping mechanism that facilitates “test-and-set” type operations
    • Automatic data rebalancing as the system scales
    • Support for live code upgrades
    • Multiple client API implementations

This introductory chapter will briefly address the recent emergence of NOSQL solutions to the challenges posed by the “Big Data” era before turning to describe more fully the distinctive benefits that Hibari provides to developers, administrators, and users of data-intensive applications.

Why NOSQL?

The NOSQL “movement” is, first off, not an outright rejection of traditional relational database management systems (RDBMS) but rather a growing recognition that today’s data environment requires a diverse storage toolset that is “Not Only SQL (NOSQL)”. Relational and NOSQL data storage solutions should be viewed as complements, with each approach better suited toward different types of applications and services.

The main driver of NOSQL has been the proliferation of applications and services that must store and serve terabytes or petabytes of data, often while striving to guarantee “always-on” availability and low latencies for end users. Organizations in many market sectors are grappling with the advent of Big Data, including but not limited to:

  • Web properties – coping with the massive data requirements of search, e-commerce, social media, and user-generated content.
  • Telecoms – managing and analyzing network logs and call data records for multi-millions of subscribers.
  • Utilities – managing and analyzing the enormous data volume associated with smart grids.
  • Financial services – storing and mining customer history data in order to analyze and model risk.
  • Retail analytics – click-stream analysis and micro-targeting.
  • Biotech – genome analysis.

Organizations in these and other data-intensive environments have been challenged to build data storage systems of unprecedented scale. Many such organizations have found their needs ill-met by traditional data storage approaches that center around relational database management systems and specialized high-end hardware. In particular:

  • Scaling up a single RDBMS instance doesn’t achieve nearly the scale required, no matter how high-end the systems or how great the expenditure.
  • Scaling out by sharding the system over multiple RDBMS instances entails enormous costs and enormous operational complexity, while at the same time forfeiting much of the power of the relational model.

Wanting Big Data capacity without crippling cost and complexity, some innovative organizations have sought a better way to scale. At the same time, with an ever-expanding array of data usage scenarios, it’s become apparent that not all scenarios require the complex querying and management functionality associated with an RDBMS. For some applications and services, SQL-structuring and strict ACID properties are overkill. Worse, in some environments they’re expensive overkill that can potentially hamstring service offerings in highly competitive markets that demand flexibility and responsiveness.

In short, recent years have seen a proliferation of services that require more data, with less structure.

Not surprisingly, some of the leading web enterprises have been at the forefront of the NOSQL movement. In particular, Google with its http://labs.google.com/papers/bigtable.html[BigTable paper] in 2006 and Amazon with its http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf[Dynamopaper] in 2007 had a profound effect on the NOSQL market. A number of NOSQL solutions have drawn inspiration from either BigTable or Dynamo or both, and in the past couple years several solutions have been released into the open source community.

While NOSQL data storage solutions vary in their particulars, they have these basic traits in common:

  • A simplified data model. Data models vary across specific solutions, and sometimes form the basis of a tripartite classification of NOSQL systems into 1) key-value data stores (such as Dynamo and Hibari); 2) column-oriented data stores (such as BigTable); and 3) document-oriented data stores (such as CouchDB). All variants, however, are simpler and more flexible in data model than the traditional RDBMS. That simplification tends to carry over to client APIs as well.
  • Distribution across multiple nodes based on commodity PCs. Affordable Big Data capacity is achieved by scaling out across tens, hundreds, or even thousands of commodity PCs. Data partitioning schemes coupled with parallel processing of incoming requests deliver the needed high performance.
  • Replication of data objects across multiple nodes, to ensure high availability in the event of component failures.

For much more on the history, merits, and design issues associated with NOSQL storage solutions, consult with your favorite search engine.

Why Hibari?

Hibari was developed internally by Cloudian, Inc. (formerly Gemini Mobile Technologies), a leading producer of mass-scale messaging and transaction systems for Tier 1 mobile operators in Asia, Europe, and the Americas. Cloudian had need for a data store that was efficient, fast, flexible, and scalable, as well as robust enough to withstand the rigors of deployment in Tier 1 telecom production environments. Dissatisfied with the then-available options, Cloudian in 2005 began work on what came to be Hibari (the name is Japanese for skylark; the kanji characters stand for “cloud bird”).

With the system having in recent years matured and been proven in production, Cloudian released Hibari to the open source community in July 2010 under the Apache 2.0 license. Cloudian regards the open source community as the best venue in which Hibari can continue to perfect and grow.

This section describes some of the distinctive features that make Hibari a very attractive option for businesses and developers seeking a modern Big Data storage system:

  • link:#engineered-erlang[Engineered in Erlang]
  • link:#chain-replication[Chain Replication for High Availability and Strong Consistency]
  • link:#scalability[Easy, Affordable Scalability]
  • link:#high-performance[High Performance, Especially for Reads and Large Values]
  • link:#simple-powerful-api[Simple But Powerful Client API]
  • link:#production-proven[Production-Proven]
  • link:#hibari-benefits-by-user[Hibari Benefits for Developers, System Administrators, and Businesses]

[[engineered-erlang]]

Engineered in Erlang

Erlang is a general purpose programming language and runtime environment designed specifically to support reliable, high-performance distributed systems. Originally developed by Ericsson in the 1980s for building advanced telecom networking systems, Erlang/OTP (Open Telecom Platform) was open-sourced in 1998. Hibari is written entirely in Erlang.

Erlang provides a range of benefits that make it the ideal foundation for a distributed key-value storage solution:

  • Concurrency. Erlang has extremely lightweight processes that communicate by message passing and have no shared memory. Scheduling, memory management, and other concurrency-related services are managed by the Erlang VM, placing no requirements for concurrency on the host operating system.
  • Distribution. Erlang is designed specifically for distributed environments. Passing messages transparently via TCP, Erlang processes on different nodes communicate with each other in exactly the same way as do processes on the same node. The simple and efficient design facilitates massive parallelism and scalability of the sort required by a high-performance distributed storage system. With its prowess for concurrency and distributed processing, it has been suggested that Erlang can be regarded as a first-of-its-kind http://www.oreillygmt.eu/open-sourcefree-software/erlang-the-ceos-view/[“application system”], analogous to an operating system except running across and coordinating multiple hosts.
  • Robustness. Erlang processes are completely independent of each other, with no data sharing. While functionally isolated, Erlang processes are able to monitor each other and to detect and respond to crashed processes, even on remote nodes.
  • Portability. The same Erlang VM can run on Linux, Unix, Windows, Macintosh, or VxWorks. Distributed Erlang processes can seamlessly communicate with each other regardless of the heterogeneity of their host operating systems. This OS portability is a valuable facilitator of storage system elasticity, as system managers may need to mix and match hosts in response to fluid demand environments.
  • Hot code upgrades. Erlang-based applications like Hibari support hot code upgrades: upgrades can be applied without shutting down the system. During the change-over, old and new code can run simultaneously. This is a key benefit for environments that require “always-on” availability for end users.

Other features reinforce Erlang’s suitability for reliable distributed applications, including incremental garbage collection, single-assignment variables, and robust exception handling.

[[chain-replication]]

Chain Replication for High Availability and Strong Consistency

The Hibari distributed key-value store implements a version of the chain replication methodology first proposed by http://www.usenix.org/event/osdi04/tech/full_papers/renesse/renesse.pdf[van Renesse and Schneider] to achieve redundancy and high availability without sacrificing data consistency. At a high level, chain replication in a Hibari storage cluster works as follows:

  • Through consistent hashing, the key space is divided across multiple storage “chains”.
  • Each chain is composed of multiple logical storage “bricks”, with each brick running in its own Erlang VM instance.
  • Within each chain, the member bricks have differentiated roles. Client-requested updates to key-value pairs are written first to the “head” brick, then automatically replicated downstream to one or more “middle” bricks and finally to the “tail” brick, which returns an update acknowledgement to the client. By contrast, read requests are directed to the tail brick, which returns the response to the client.

image:images/chain_replication.png[]

While most distributed storage systems are able to guarantee only weak or eventual data consistency across replicas – placing the burden on the client application (and the client application developer) to manage the potential inconsistencies – Hibari with its chain replication implementation guarantees strong consistency. Data updates are considered complete, and are acknowledged to clients, only when they have replicated through the chain to the tail; and read requests are processed only by the tail. Consequently, after an object update is acknowledged to a Hibari client, other clients are guaranteed to see only the newest version of that object. This strong consistency is valuable in environments where ‘eventual consistency’ is at odds with the service level expected by end users, or where system designers do not want to clutter client applications with the logic required to manage data inconsistency.

The “length” of a chain is configurable and can be based on your desired degree of replication and redundancy. For example, a chain of length four would have a head brick, two middle bricks, and a tail brick; while a three-brick chain would have a head, one middle, and a tail. A chain can also operate at length two (a head and tail, with no middle) and even at length one (one brick playing both the head role and the tail role).

Because chains can operate at any length, and because the system is able to detect failures within the chain and to adjust member brick roles accordingly, Hibari delivers high availability as well as strong data consistency. For example, if in a three-brick chain the head brick goes down, the middle brick automatically takes over the head brick role, allowing the chain to continue functioning normally:

image:images/automatic_failover.png[]

If the new head brick failed also, the lone remaining brick would then play both the head role and the tail role, processing all writes and reads itself as a single-brick “chain”.

While multiple logical bricks can run on a single physical node, for high availability it is of course desirable that a particular chain’s member bricks be deployed on separate machines. If you want to run multiple bricks per machine and also ensure high availability for each chain, an attractive deployment option is to “stripe” the chains across machines:

image:images/load_balanced_chains.png[]

Note also that because head bricks (receiving incoming write requests) and tail bricks (replying to write requests and processing read requests) bear more load than do middle bricks, load balancing across machines can be achieved in part by allocating the different brick roles evenly, as in the diagram above.

In the event of a physical node failure, bricks within each impacted chain automatically shift roles, and each chain continues to provide normal service to clients:

image:images/automatic_failover_2.png[]

For further information about chain replication, fail-over, and recovery in a Hibari storage system, and for information about Hibari’s redundantly structured cluster membership application called the Admin Server, see these sections of the Hibari System Administrator’s Guide:

  • link:hibari-sysadmin-guide.en.html#hibari-architecture[Hibari Architecture]
  • link:hibari-sysadmin-guide.en.html#life-of-brick[The Life of a (Logical) Brick]
  • link:hibari-sysadmin-guide.en.html#dynamic-cluster-reconfiguration[Dynamic Cluster Reconfiguration]
  • link:hibari-sysadmin-guide.en.html#admin-server-app[The Admin Server Application]

[[scalability]]

Easy, Affordable Scalability

Hibari provides Big Data scalability while minimizing the cost and operational complexity of cluster growth:

  • Hibari scales horizontally by the addition of more chains, deployed on more physical nodes. The total storage and processing capacity of a Hibari cluster increases linearly as machines are added to the cluster.
  • The system rebalances data storage distribution automatically as chains are added to (or removed from) the cluster, with no downtime. You can grow (or shrink) your Hibari storage cluster with no service interruption.
  • Hibari runs on commodity PCs. Further, the system easily accommodates heterogeneous hardware resources. Bricks within the storage cluster can have different RAM and disk sizes, and different CPU speeds. You can tune Hibari’s consistent hash function to optimize your cluster’s utilization of mixed hardware. Each chain can be assigned a weighting factor that will increase or decrease that chain’s portion of the overall key space, relative to other chains.

In addition to supporting mixed hardware, Erlang-based Hibari can run on most any OS. With its easy adaptability to disparate hardware and operating systems, you can scale Hibari incrementally, with whatever resources you have available. It’s not necessary to buy all your resources at once, or all of the same kind.

Note

The outer limits of Hibari’s horizontal scalability have not been definitely determined, but 200 to 250 nodes is a practical boundary due to the limitations of Erlang’s built-in network distribution implementation. Also, while Hibari chains could theoretically be stretched across multiple data centers to provide geographic redundancy, to date only single data center deployments have been tested and used in production.

For further information on resizing a Hibari cluster, see link:hibari-sysadmin-guide.en.html#dynamic-cluster-reconfiguration[Dynamic Cluster Reconfiguration] in the Hibari System Administrator’s Guide.

[[high-performance]]

High Performance, Especially for Reads and Large Values

Several features work in combination to drive high performance in a Hibari storage cluster, even at Big Data scale:

  • The Erlang technology that underlies Hibari was specifically designed for and excels at distributed parallel processing.
  • Hibari’s implementation of consistent hashing and chain replication partitions the key-space across multiple chains, enabling parallel simultaneous processing of requests incoming to individual chains. The distribution of data across chains is tunable to allow optimal utilization of heterogeneous hardware resources.
  • Hibari’s chain replication implementation further aids performance by assigning storage bricks differentiated processing roles as head, middle, or tail. This division of labor particularly benefits read performance, as read requests are processed by “tail” bricks that do not bear the load of initial processing of write requests (since that work is done by “head” bricks).
  • Hibari supports a number of performance-tuning options on a per-table basis. For example, while some distributed KVDBs support only disk-based storage or only RAM-based storage of value blobs, Hibari lets you choose RAM+disk-based or disk-only storage on a per-table basis, depending on your application needs. Whichever storage option you choose, all data changes are logged to disk to ensure data durability in the event of power failures. A batch commit technique is used to minimize disk I/O.

Leveraging this feature set, Hibari is able to deliver scalable high performance that is competitive with leading open source NOSQL storage systems, while also providing the data durability and strong consistency that many systems lack. Hibari’s performance relative to other NOSQL systems is particularly strong for reads and for large value (> 200KB) operations. Hibari’s consistently high performance even for large values distinguishes the system from solutions that are tailored toward small value operations.

As one example of real-world performance, in a multi-million user webmail deployment equipped with traditional HDDs (non SSDs), Hibari is processing about 2,200 transactions per second, with read latencies averaging between 1 and 20 milliseconds and write latencies averaging between 20 and 80 milliseconds.

[[simple-powerful-api]]

Simple But Powerful Client API

As a key-value store, Hibari’s core data model and client API model are simple by design: blob-based key-value pairs can be inserted, retrieved, and deleted from lexicographically sorted tables. While Hibari thus provides the flexibility and scalability associated with key-value stores, the system also provides distinctive features that enhance the power of client applications and developers:

  • Clients can optionally assign per-object expiration times.
  • Clients can optionally assign per-object custom flags. This flexible, custom meta-data can be updated with or without updating the associated value blob, and can be retrieved with or without the value blob.
  • Objects are automatically timestamped each time they are updated. This timestamping mechanism facilitates “test-and-set” type operations: clients can specify that a requested operation be performed only if the target key’s timestamp matches the client’s expectations.
  • Within key-prefix range limits (specifically, within individual chains but not across chains), Hibari’s client API supports atomic transactions. This support for “micro-transactions” sets Hibari apart from other open source KVDBs and can greatly simplify the creation of robust client applications.

Hibari supports multiple client API implementations including:

  • Native Erlang
  • Universal Binary Format (UBF)
  • Thrift
  • Amazon S3
  • JSON-RPC

You can develop Hibari client applications in a variety of languages including Java, C/C++, Python, Ruby, and Erlang.

For further information about Hibari’s client API, see link:#client-api-erlang[Client API: Native Erlang] and the subsequent client API chapters in this guide.

Note

The Hibari source distribution does not include Amazon S3 and JSON-RPC. They are separate external projects.

[[production-proven]]

Production-Proven

While initial development work on Hibari was geared generally toward the data storage demands of the Tier 1 telecom sector, as the system evolved it needed to meet the requirements of a particular major Asian carrier that wished to launch a GB webmail service. This customer’s requirements for Hibari included the following:

  • Several million users from the start.
  • Several billion stored messages within a few months of launch.
  • Hundreds of TB storage capacity.
  • Elasticity to support continual growth.
  • Low system costs, particularly since the service would employ the “freemium” model.
  • Individual messages could range in size from a few bytes to many MB with attachments.
  • Support for per-object meta-data required.
  • Strong consistency required, for interactive sessions.
  • Data durability required – loss of messages or meta-data unacceptable.
  • High availability – an “always on”, branded service.
  • Low latency, with < 1 second response times for end user transactions.

Hibari was built to meet these rigorous requirements, was hardened through extensive testing and trials, and went live in support of this large-scale webmail system at the beginning of 2010. The system now stores billions of messages on behalf of millions of end users, while meeting customer requirements for availability, latency, consistency, durability, and affordability.

Coinciding with Hibari’s development and fine tuning for this GB webmail service, the system was also deployed as a storage solution for two major Asian carriers’ mobile social networking services. In this context, Hibari stores user profile data as well as digital goods of varying types and sizes.

[[hibari-benefits-by-user]]

Hibari Benefits for Developers, System Administrators, and Businesses

For application developers, Hibari offers a distinctive set of benefits not often found in distributed key-value stores:

  • Strong data consistency guarantees that relieve client applications of the burden of managing potential inconsistencies.
  • Micro-transaction support that simplifies the creation of powerful applications.
  • Per-object custom flags that facilitate flexible, service-specific application design.
  • Support for a variety of API implementations and development languages.

For system administrators, Hibari provides valuable operational automations that simplify data management in a dynamic storage environment:

  • Automatic data replication.
  • Automatic failover when a node goes down.
  • Automatic repair when a failed node comes back up.
  • Automatic rebalancing of data as a cluster grows or shrinks.

For businesses as a whole, Hibari offers affordable Big Data scalability while delivering the high availability and low latencies that service users demand. Hibari is an appropriate storage solution for a range of data-intensive service scenarios including but not limited to large-scale messaging, social media, and archiving. Hibari offers particular value in environments that require strong data consistency and/or high performance across a variety of object types and sizes.

Getting Started

This section covers the following topics to help you get up and running with Hibari:

  • link:#system-requirements[System Requirements]
  • link:#required-software[Required Third Party Software]
  • link:#download-hibari[Downloading Hibari]
  • link:#installing-single-node[Installing a Single-Node Hibari System]
  • link:#starting-single-node[Starting and Stopping a Single-Node Hibari System]
  • link:#installing-multi-node[Installing a Multi-Node Hibari Cluster]
  • link:#starting-multi-node[Starting and Stopping a Multi-Node Hibari Cluster]
  • link:#creating-tables[Creating New Tables]

[[system-requirements]]

System Requirements

Hibari will run on any OS that the Erlang VM supports, which includes most Unix and Unix-like systems, Windows, and Mac OS X. See Implementation and Ports of Erlang from the official Erlang documentation for further information.

For guidance on hardware requirements in a production environment, see link:hibari-sysadmin-guide.en.html#brick-hardware[Notes on Brick Hardware] in the Hibari System Administrator’s Guide.

[[required-software]]

Required Third-Party Software

Hibari’s requirements for third party software depend on whether you’re doing a single-node installation or a multi-node installation.

Required Software for a Single-Node Installation:

The node on which you plan to install Hibari must have the following software:

Required Software for a Multi-Node Installation:

When you install Hibari on multiple nodes you will use an installer tool that simplifies the cluster set-up process. When you use this tool you will identify the hosts on which you want Hibari to be installed, and the tool will manage the installation of Hibari onto those target hosts. You can run the tool itself from one of your target Hibari nodes or from a different machine. There are distinct requirements for third party software on the “installer node” (the machine from which you run the installer tool) and on the Hibari nodes (the machines on which Hibari will be installed and run.)

Installer Node Required Software

The installer node must have the software listed below. If you are missing any of these items, you can use the provided links for downloads and installation instructions.

There are currently no known version requirements for Bash, Expect, Perl, or SSH.

Hibari Nodes Required Software

The nodes on which you plan to install Hibari must have the software listed below.

[[download-hibari]]

Downloading Hibari

Hibari is not yet available as a pre-built release. In the meanwhile, you can build Hibari from source. Follow the instructions in <<HibariBuildingSource>>, and then return to this section to continue the set-up process.

When you build Hibari your output is two files that you will later use in the set-up process:

  • A tarball package hibari-X.Y.Z-DIST-ARCH-WORDSIZE.tgz
  • An md5sum file hibari-X.Y.Z-DIST-ARCH-WORDSIZE-md5sum.txt

X.Y.Z is the release version, DIST is the release distribution, ARCH is the release architecture, and WORDSIZE is the release wordsize.

[[installing-single-node]]

Installing a Single-Node Hibari System

A single-node Hibari system will not provide data replication and redundancy in the way that a multi-node Hibari cluster will. However, you may wish to deploy a simple single-node Hibari system for testing and development purposes.

  1. Create a directory for running Hibari:

    $ mkdir running-directory
    
  2. Untar the Hibari tarball package that you created when you built Hibari from source:

    $ tar -C running-directory -xvf hibari-X.Y.Z-DIST-ARCH-WORDSIZE.tgz
    

Important

On your Hibari node, in the system’s /etc/sysctl.conf file, set vm.swappiness=1. Swappiness is not desirable for an Erlang VM.

[[starting-single-node]]

Starting and Stopping Hibari on a Single Node
Starting and Bootstrapping Hibari
  1. Start Hibari:

    $ running-directory/hibari/bin/hibari start
    
  2. If this is the first time you’ve started Hibari, bootstrap the system:

    $ running-directory/hibari/bin/hibari-admin bootstrap
    

The Hibari bootstrap process starts Hibari’s Admin Server on the single node and creates a single table “tab1” serving as Hibari’s default table. For information on creating additional tables, see link:#creating-tables[Creating New Tables].

Verifying Hibari

Do these quick checks to verify that your single-node Hibari system is up and running.

  1. Confirm that you can open the “Hibari Web Administration” page:

    $ your-favorite-browser http://127.0.0.1:23080
    
  2. Confirm that you can successfully ping the Hibari node:

    $ running-directory/hibari/bin/hibari ping
    

IMPORTANT: A single-node Hibari system is hard-coded to listen on the localhost address 127.0.0.1. Consequently the Hibari node is reachable only from the node itself.

Stopping Hibari

To stop Hibari:

$ running-directory/hibari/bin/hibari stop

[[installing-multi-node]]

Installing a Multi-Node Hibari Cluster

Before you install Hibari on to the target nodes you must complete these preparation steps:

  • Set up required user privileges on the installer node and on the target Hibari nodes.
  • Download the Cluster installer tool.
  • Configure the Cluster installer tool.
Setting Up Your User Privileges

The system user ID that you use to perform the installation must be different than the Hibari runtime user. Your installing user account ($USER) must be set up as follows:

  • $USER must exist on the installer node and also on the target Hibari nodes.
  • $USER on the installer node must have SSH private/public keys, with the SSH agent set up to enable password-less SSH login.
  • $USER account must be accessible with password-less SSH login on the target Hibari nodes.
  • $USER must have password-less sudo access on the target Hibari nodes.

If your installing user account does not currently have the above privileges, follow these steps:

  1. As the root user, add your installing user ($USER) to the installer node. Then on each of the Hibari nodes, add your installing user and grant your user password-less sudo access:

    $ useradd $USER
    $ passwd $USER
    $ visudo
    # append the following line and save it
    $USER  ALL=(ALL)       NOPASSWD: ALL
    

Note

If you get a “sudo: sorry, you must have a tty to run sudo” error while testing sudo, try commenting out following line inside of the /etc/sudoers file:

$ visudo
Defaults    requiretty
  1. On the installer node, create a new SSH private/public key for your installing user:

    $ ssh-keygen
    # enter your password for the private key
    $ eval `ssh-agent`
    $ ssh-add ~/.ssh/id_rsa
    # re-enter your password for the private key
    
  2. On each of the Hibari nodes:

  • Append an entry for the installer node to the ~/.ssh/known_hosts file.
  • Append an entry for your public SSH key to the ~/.ssh/authorized_keys file.

In the example below, the target Hibari nodes are dev1, dev2, and dev3:

$ ssh-copy-id -i ~/.ssh/id_rsa.pub $USER@dev1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub $USER@dev2
$ ssh-copy-id -i ~/.ssh/id_rsa.pub $USER@dev3

Note

If your installer node will be one of the Hibari cluster nodes, make sure that you ssh-copy-id to the installer node also.

  1. Confirm that password-less SSH access to the each of the Hibari nodes works as expected:

    $ ssh $USER@dev1
    $ ssh $USER@dev2
    $ ssh $USER@dev3
    

Tip

If you need more help with SSH set-up, check http://inside.mines.edu/~gmurray/HowTo/sshNotes.html.

[[download-cluster]]

Downloading the Cluster Installer Tool

“Cluster” is a simple tool for installing, configuring, and bootstrapping a cluster of Hibari nodes. The tool is not part of the Hibari package itself, but is available from GitHub.

Note

The Cluster tool should meet the needs of most users. However, this tool’s “target node” recipe is currently Linux-centric (e.g. useradd, userdel, ...). Patches and contributions for other OS and platforms are welcome. For non-Linux deployments, the Cluster tool is rather simple so installation can be done manually by following the tool’s recipe.

  1. Create a working directory into which you will download the Cluster installer tool:

    $ mkdir working-directory
    
  2. Download the Cluster tool’s Git repository from GitHub:

    $ cd working-directory
    $ git clone git://github.com/hibari/clus.git
    

The download creates a sub-directory clus under which the installer tool and various supporting files are stored.

[[config-cluster]]

Configuring the Cluster Installer Tool

The Cluster tool requires some basic configuration information that indicates how you want your Hibari cluster to be set up. You will create a simple text file that specifies your desired configuration, and then later use the file as input when you run the Cluster tool.

It’s simplest to create the file in the same working directory in which you downloaded the cluster tool. You can give the file any name that you want; for purposes of these instructions we will use the file name hibari.config.

Below is a sample hibari.config file. The file that you create must include all of these parameters, and the values must be formatted in the same way as in this example (with parentheses and quotation marks as shown). Parameter descriptions follow the example file.

ADMIN_NODES=(dev1 dev2 dev3)
BRICK_NODES=(dev1 dev2 dev3)
BRICKS_PER_CHAIN=2

ALL_NODES=(dev1 dev2 dev3)
ALL_NETA_ADDRS=("10.181.165.230" "10.181.165.231" "10.181.165.232")
ALL_NETB_ADDRS=("10.181.165.230" "10.181.165.231" "10.181.165.232")
ALL_NETA_BCAST="10.181.165.255"
ALL_NETB_BCAST="10.181.165.255"
ALL_NETA_TIEBREAKER="10.181.165.1"

ALL_HEART_UDP_PORT="63099"
ALL_HEART_XMIT_UDP_PORT="63100"

[[eligible-admin-nodes]]

  • ADMIN_NODES
    • Host names of the nodes that will be eligible to run the Hibari Admin Server. For complete information on the Admin Server, see link:hibari-sysadmin-guide.en.html#admin-server-app[The Admin Server Application] in the Hibari System Administrator’s Guide.
  • BRICK_NODES
    • Host names of the nodes that will serve as Hibari storage bricks. Note that in the sample configuration file above there are three storage brick nodes (dev1, dev2, and dev3), and these three nodes are each eligible to run the Admin Server.
  • BRICKS_PER_CHAIN
    • Number of bricks per replication chain. For example, with two bricks per chain there will be two copies of the data stored in the chain (one copy on each brick); with three bricks per chain there will be three copies, and so on. For an overview of chain replication, see link:#chain-replication[Chain Replication for High Availability and Strong Consistency] in this document. For chain replication detail, see the Hibari System Administrator’s Guide.
  • ALL_NODES
    • This list of all Hibari nodes is the union of ADMIN_NODES and BRICK_NODES.
  • ALL_NETA_ADDRS
    • As described in link:hibari-sysadmin-guide.en.html#partition-detector[The Partition Detector Application] in the Hibari System Administrator’s guide, the nodes in a multi-node Hibari cluster should be connected by two networks, Network A and Network B, in order to detect and manage network partitions. The ALL_NETA_ADDRS parameter specifies the IP addresses of each Hibari node within Network A, which is the network through which data replication and other Erlang communications will take place. The list of the IP addresses should correspond in order to host names you listed in the ALL_NODES setting.
  • ALL_NETB_ADDRS
    • IP addresses of each Hibari node within Network B. Network B is used only for heartbeat broadcasts that help to detect network partitions. The list of the IP addresses should correspond in order to host names you listed in the ALL_NODES setting.
  • ALL_NETA_BCAST
    • IP broadcast address for Network A.
  • ALL_NETB_BCAST
    • IP broadcast address for Network B.
  • ALL_NETA_TIEBREAKER
    • Within Network A, the IP address for the network monitoring application to use as a “tiebreaker” in the event of a partition. If the network monitoring application on a Hibari node determines that Network A is partitioned and Network B is not partitioned, then if the Network A tiebreaker IP address responds to a ping, then the local node is on the “correct” side of the partition. Ideally the tiebreaker should be the address of the Layer 2 switch or Layer 3 router that all Erlang network distribution communications flow through.
  • ALL_HEART_UDP_PORT
    • UDP port for heartbeat listener.
  • ALL_HEART_XMIT_UDP_PORT
    • UDP port for heartbeat transmitter.

For more detail on network monitoring configuration settings, see the partition-detector’s OTP application source file (https://github.com/hibari/partition-detector/raw/master/src/partition_detector.app.src).

CAUTION: In a production setting, Network A and Network B should be physically different networks and network interfaces. However, for testing and development purposes the same physical network can be used for Network A and Network B (as in the sample configuration file above).

As final configuration steps, on each Hibari node:

  • Make sure that the /etc/hosts file has entries for all Hibari nodes in the cluster. For example:

    10.181.165.230  dev1.your-domain.com    dev1
    10.181.165.231  dev2.your-domain.com    dev2
    10.181.165.232  dev3.your-domain.com    dev3
    
  • In the system’s /etc/sysctl.conf file, set vm.swappiness=1. Swappiness is not desirable for an Erlang VM.

Installing Hibari

From your installer node, logged in as the installer user, take these steps to create your Hibari cluster:

  1. In the working directory in which you link:#download-cluster[downloaded the Cluster tool] and link:#config-cluster[created your cluster configuration file], place a copy of the Hibari tarball package and md5sum file:

    $ cd working-directory
    $ ls -1
    clus
    hibari-X.Y.Z-DIST-ARCH-WORDSIZE-md5sum.txt
    hibari-X.Y.Z-DIST-ARCH-WORDSIZE.tgz
    hibari.config
    $
    
  2. Create the “hibari” user on all Hibari nodes:

    $ for i in dev1 dev2 dev3 ; do ./clus/priv/clus.sh -f init hibari $i ; done
    hibari@dev1
    hibari@dev2
    hibari@dev3
    

Note

If the “hibari” user already exists on the target nodes, the -f option will forcefully delete and then re-create the “hibari” user.

  1. Install the Hibari package on all Hibari nodes, via the newly created “hibari” user:

    $ ./clus/priv/clus-hibari.sh -f init hibari hibari.config hibari-X.Y.Z-DIST-ARCH-WORDSIZE.tgz
    hibari@dev1
    hibari@dev2
    hibari@dev3
    

Note

By default the Cluster tool installs Hibari into /usr/local/var/lib on the target nodes. If you prefer a different location, before doing the install open the clus.sh script (in your working directory, under /clus/priv/) and edit the CT_HOMEBASEDIR variable.

[[starting-multi-node]]

Starting and Stopping a Multi-Node Hibari Cluster

You can use the Cluster installer tool to start and stop your multi-node Hibari cluster, working from the same node from which you managed the installation process. Note that in each of the Hibari commands in this section you’ll be referencing the name of the link:#config-cluster[Cluster tool configuration file] that you created during the installation procedure.

Starting and Bootstrapping the Hibari Cluster
  1. Change to the working directory in which you downloaded the Cluster tool, then start Hibari on all Hibari nodes via the “hibari” user:

    $ cd working-directory
    $ ./clus/priv/clus-hibari.sh -f start hibari hibari.config
    hibari@dev1
    hibari@dev2
    hibari@dev3
    
  2. If this is the first time you’ve started Hibari, bootstrap the system via the “hibari” user:

    $ ./clus/priv/clus-hibari.sh -f bootstrap hibari hibari.config
    hibari@dev1 => hibari@dev1 hibari@dev2 hibari@dev3
    

The Hibari bootstrap process starts Hibari’s Admin Server on the first link:#eligible-admin-nodes[eligible admin node] and creates a single table “tab1” serving as Hibari’s default table. For information about creating additional tables, see link:#creating-tables[Creating New Tables].

Note

If bootstrapping fails due to “another_admin_server_running” error, please stop the other Hibari cluster(s) running on the network; or reconfigure the Cluster tool to assign link:#eligible-admin-nodes[Hibari heartbeat listener ports] that are not in use by another Hibari cluster or other applications and then repeat the cluster installation procedure.

Verifying the Hibari Cluster

Do these simple checks to verify that Hibari is up and running.

  1. Confirm that you can open the “Hibari Web Administration” page:

    $ your-favorite-browser http://dev1:23080
    
  2. Confirm that you can successfully ping each of your Hibari nodes:

    $ ./clus/priv/clus-hibari.sh -f ping hibari hibari.config
    hibari@dev1 ... pong
    hibari@dev2 ... pong
    hibari@dev3 ... pong
    
Stopping the Hibari Cluster

Stop Hibari on all Hibari nodes via the “hibari” user:

$ cd working-directory
$ ./clus/priv/clus-hibari.sh -f stop hibari hibari.config
ok
ok
ok
hibari@dev1
hibari@dev2
hibari@dev3

[[creating-tables]]

Creating New Tables

The simplest way to create a new table is via the Admin Server’s GUI. Open http://localhost:23080/ and click the “Add a table” link. In addition to the GUI, the hibari-admin tool can also be used to create a new table. See the hibari-admin tool for usage details.

Note

For information about creating tables using the administrative API, see the Hibari System Administrator’s Guide.

When adding a table through the GUI, you have these table configuration options:

  • Local
    • Boolean. If true, all bricks for storing the new table’s data will be created on the local node, i.e. the node that’s running the Admin Server. If false, then the “NodeList” field is used to specify which cluster nodes the new bricks should use.
  • BigData
    • Boolean. If true, value blobs will be stored on disk.
  • DiskLogging
    • Boolean. If true, all updates will be written to the write-ahead log for persistence. If false, bricks will run faster but at the expense of data loss in a cluster-wide power failure.
  • SyncWrites
    • Boolean. If true, all writes to the write-ahead log will be flushed to stable storage via the fsync(2) system call. If false, bricks will run faster but at the expense of data loss in a cluster-wide power failure.
  • VarPrefix
    • Boolean. If true, then a variable-length prefix of the key will be used as input for the consistent hashing function. If false, the entire key will be used.

Many applications can benefit from using a variable-length or fixed-length prefix hashing scheme. As an example, consider an application that maintains state for various users. The app wishes to use micro-transactions to update various keys (in the same table) related to that user. The table can be created to use VarPrefix=true, together with VarPrefixSeparator=47 (ASCII 47 is the forward slash character) and VarPrefixNumSeparator=2, to create a hashing scheme that will guarantee that keys /FooUser/summary and /FooUser/thing1 and /FooUser/thing9 are all stored by the same chain.

Note

The HTTP interface for creating tables does not expose the fixed-length key prefix scheme. The Erlang API must be used in this case.

  • VarPrefixSeparator
    • Integer. Define the character used for variable-length key prefix calculation. Note that the default value of ASCII 47 (the “/” character), or any other character, does not imply any UNIX/POSIX style file or directory semantics.
  • VarPrefixNumSeparators
    • Integer. Define the number of VarPrefixSeparator bytes, and all bytes in between, used for consistent hashing. If VarPrefixSeparator=47 and VarPrefixNumSeparators=3, then for a key such as /foo/bar/baz, the prefix used for consistent hashing will be /foo/bar/.
  • Bricks
    • Integer. If Local=true (see above), then this integer defines the total number of logical bricks that will be created on the local node. This value is ignored if Local=false.
  • BPC
    • Integer. Define the number of bricks per chain.

The algorithm used for creating chain -> brick mapping is based on a “striping” principle: enough chains are laid across bricks in a stripe-wise manner so that all nodes (aka physical bricks) will have the same number of logical bricks in head, middle, and tail roles. See the example in the Hibari System Administrator’s Guide of link:hibari-sysadmin-guide.en.html#3-chains-striped-across-3-bricks[3 chains striped across three nodes].

The Erlang API must be used to create tables with other chain layout patterns.

  • NodeList
    • Comma-separated string. If Local=false, specify the list of nodes that will run logical bricks for the new table. Each node in the comma-separated list should take the form NodeName@HostName. For example, use hibari1@machine-a, hibari1@machine-b, hibari1@machine-c to specify three nodes.
  • NumNodesPerBlock
    • Integer. If Local=false, then this integer will affect the striping behavior of the default chain striping algorithm. This value must be zero (i.e. this parameter is ignored) or a multiple of the BPC parameter.

For example, if NodeList contains nodes A, B, C, D, E, and F, then the following striping patterns would be used:

  • NumNodesPerBlock=0 would stripe across all 6 nodes for 6 chains total.
  • NumNodesPerBlock=2 and BPC=2 would stripe 2 chains across nodes A & B, 2 chains across C & D, and 2 chains across E & F.
  • NumNodesPerBlock=3 and BPC=3 would stripe 3 chains across nodes A & B & C and 3 chains across D & E & F.
  • BlockMultFactor
    • Integer. If Local=false, then this integer will affect the striping behavior of the default chain striping algorithm. This value must be zero (i.e. this parameter is ignored) or greater than zero.

For example, if NodeList contains nodes A, B, C, D, E, and F, then the following striping patterns would be used:

  • NumNodesPerBlock=0 and BlockMultFactor=0 would stripe across all 6 nodes for 6 chains total.
  • NumNodesPerBlock=2 and BlockMultFactor=5 and BPC=2 would stripe 2*5=10 chains across nodes A & B, 2*5=10 chains across C & D, and 2*5=10 chains across E & F, for a total of 30 chains.
  • NumNodesPerBlock=3 and BlockMultFactor=4 and BPC=3 would stripe 3*4=12 chains across nodes A & B & C and 3*4=12 chains across D & E & F, for a total of 24 chains.
The Hibari Data Model

If a Hibari table were represented within an SQL database, it would look something like this:

[[sql-definition-hibari]]

include::texts-src/hibari-sql-definition.txt[]

Hibari table names use the Erlang data type ``atom’‘. The types of all key-related attributes are presented below.

include::texts-src/hibari-key-value-attrs.txt[]

include::texts-src/hibari-key-value-attrs-expl.txt[]

The practical constraints on maximum value blob size are affected by total blob size and frequency of large blob access. For example, storing an occasional 64MB value blob is different than a 100% write workload of 100% 64MB value blobs. The Hibari client API does not have a method to update or fetch less than the entire value blob, so a brick can be blocked for many seconds if it tried to operate on (for example) even a single 4GB blob. In addition, other processes can be blocked by ‘busy_dist_port’ events while processing big value blobs.

Hibari Client API Overview

As a key-value database, Hibari provides a simple client API with primitive operations for inserting, retrieving, and deleting data. Within certain restrictions, the API also supports compound operations that optionally can be executed as atomic transactions.

Supported Operations

Hibari’s client API supports the operations listed below.

Data Insertion
brick_simple:add(Table, Key, Value[, ExpTime][, Flags][, Timeout])

Adds a key-value pair that does not yet exist, along with optional flags.

Successful adding of a new key-value pair:

> brick_simple:add(tab1, <<"foo">>, <<"Hello, world!">>).
{ok,1271542959131192}

Failed attempt to add a key that already exists:

> brick_simple:add(tab1, <<"foo">>, <<"Goodbye, world!">>).
{key_exists,1271542959131192}
brick_simple:replace(Table, Key, Value[, ExpTime][, Flags][, Timeout])

Assigns a new value and/or new flags to a key that already exists.

brick_simple:set(Table, Key, Value[, ExpTime][, Flags][, Timeout])

Sets a key-value pair and optional flags regardless of whether the key yet exists.

brick_simple:rename(Table, Key, NewKey[, ExpTime][, Flags][, Timeout])

Renames a key that already exists.

Successful renaming of a key-value pair:

> brick_simple:rename(tab1, <<"my/foo">>, <<"my/bar">>).
{ok,1271543165272987}

rename operation fails if key and newkey do not share a common key prefix:

> brick_simple:rename(tab1, <<"my/foo">>, <<"her/foo">>).
...

See TODO (Creating New Table - VarPrefix) for more details.

Data Retrieval
  • Retrieve a key and optionally its associated value and flags:
    • link:#brick-simple-get[brick_simple:get/4]
  • Retrieve multiple lexicographically contiguous keys and optionally their associated values and flags:
    • link:#brick-simple-get-many[brick_simple:get_many/5]
Data Deletion
  • Delete a key-value pair and associated flags:
    • link:#brick-simple-delete[brick_simple:delete/4]
Compound Operations
  • Execute a specified list of operations, optionally as an atomic transaction (micro-transaction):
    • link:#brick-simple-do[brick_simple:do/4]
Fold Operations
  • Implement a fold operation across all keys in a table:
    • link:#brick-simple-fold-table[brick_simple:fold_table/7]
  • Implement a fold operation across all keys having a specified prefix:
    • link:#brick-simple-fold-key[brick_simple:fold_key_prefix/9]

Note

Fold operations are performed at client side, not server side.

Check and Swap (CAS)

If desired, clients can apply a “check and swap” (or “test and set”) logic to data insertion, retrieval, and deletion operations so that the operation will be executed only if the target key has the exact timestamp specified in the request.

Micro-Transaction

TODO

Client API: Native Erlang
Data Insertion
  • Add a key-value pair that does not yet exist, along with optional flags:

    • link:#brick-simple-add[brick_simple:add/6]
  • Assign a new value and/or new flags to a key that already exists:

    • link:#brick-simple-replace[brick_simple:replace/6]
  • Rename a key that already exists:

    • link:#brick-simple-rename[brick_simple:rename/6]
  • Set a key-value pair and optional flags regardless of whether the key yet exists:

    • link:#brick-simple-set[brick_simple:set/6]
Data Retrieval
  • Retrieve a key and optionally its associated value and flags:
    • link:#brick-simple-get[brick_simple:get/4]
  • Retrieve multiple lexicographically contiguous keys and optionally their associated values and flags:
    • link:#brick-simple-get-many[brick_simple:get_many/5]
Data Deletion
  • Delete a key-value pair and associated flags:
    • link:#brick-simple-delete[brick_simple:delete/4]
Compound Operations
  • Execute a specified list of operations, optionally as an atomic transaction (micro-transaction):
    • link:#brick-simple-do[brick_simple:do/4]

If desired, clients can apply a “test ‘n set” logic to data insertion, retrieval, and deletion operations so that the operation will be executed only if the target key has the exact timestamp specified in the request.

Fold Operations
  • Implement a fold operation across all keys in a table:
    • link:#brick-simple-fold-table[brick_simple:fold_table/7]
  • Implement a fold operation across all keys having a specified prefix:
    • link:#brick-simple-fold-key[brick_simple:fold_key_prefix/9]

Note

Fold operations are performed at client side, not server side.

brick_simple:add/6

Adds Key and Value pair (and optional Flags) to the table Table if the key does not already exist. The operation will fail if Key already exists.

brick_simple:add(Table, Key, Value)
brick_simple:add(Table, Key, Value, Flags)
brick_simple:add(Table, Key, Value, Timeout)
brick_simple:add(Table, Key, Value, ExpTime, Flags, Timeout)
Parameters:
  • Table (table()) –

    Name of the table to which to add the key-value pair

    • -type table() :: atom()
  • Key (key()) –

    Key to add to the table, in association with a paired value

    • -type key() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]

Note

While the Key may be specified as either iolist() or binary(), it will be converted into binary before operation execution. The same is true of Value.

Parameters:
  • Value (val()) –

    Value to associate with the key

    • -type val() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]
  • ExpTime (exp_time()) –
    • Time at which the key will expire, expressed as a Unix time_t().
    • Optional; defaults to 0 (no expiration).
    • -type exp_time() :: time_t()
    • -type time_t() :: integer()
  • Flags (flags_list()) –
    • List of operational flags to apply to the add operation, and/or custom property flags to associate with the key-value pair in the database. Heavy use of custom property flags is discouraged due to RAM-based storage
    • Optional; defaults to empty list
    • -type flags_list() :: [do_op_flag() | property()]
    • -type do_op_flag() :: 'value_in_ram'
      • Store the value blob in RAM, overriding the default storage location of the brick

        Note

        'value_in_ram' flag have not been extensively tested

    • -type property() :: atom() | {term(), term()}
  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success return

Return type:{'ok', timestamp()}

Error returns

Return type:{'key_exists', timestamp()}
  • The operation failed because the key already exists.
  • -type timestamp() :: integer()
Return type:'invalid_flag_present'
  • The operation failed because an invalid do_op_flag() was found in the Flags argument.
Return type:'brick_not_available'
  • The operation failed because the chain that is responsible for this key is currently length zero and therefore unavailable.
Return type:{{'nodedown',node()},{'gen_server','call',term()}}
  • The operation failed because the server brick handling the request has crashed or else a network partition has occurred between the client and server. The client should resend the query after a short delay, on the assumption that the Admin Server will have detected the failure and taken steps to repair the chain.
  • -type node() :: atom()
Examples

Successful adding of a new key-value pair:

> brick_simple:add(tab1, <<"foo">>, <<"Hello, world!">>).
{ok,1271542959131192}

Failed attempt to add a key that already exists:

> brick_simple:add(tab1, <<"foo">>, <<"Goodbye, world!">>).
{key_exists,1271542959131192}

Successful adding of a new key-value pair, with value to be stored in RAM regardless of brick’s default storage setting:

> brick_simple:add(tab1, "foo1", "this is value1", ['value_in_ram']).
{ok,1271542959131192}

Successful adding of a new key-value pair, using a non-default operation timeout:

> brick_simple:add(tab1, "foo2", "this is value2", 20000).
{ok,1271542959131192}
brick_simple:replace/6

Replace Key and Value pair (and optional Flags) in the table Table if the key already exists. The operation will fail if Key does not already exist

brick_simple:replace(Table, Key, Value)
brick_simple:replace(Table, Key, Value, Flags)
brick_simple:replace(Table, Key, Value, Timeout)
brick_simple:replace(Table, Key, Value, ExpTime, Flags, Timeout)
Parameters:
  • Table (table()) –

    Name of the table in which to replace the key-value pair.

    • -type table() :: atom()
  • Key

    Key to replace in the table, in association with a new paired value

    • -type key() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]

Note

While the Key may be specified as either iolist() or binary(), it will be converted into binary before operation execution. The same is true of Value.

Parameters:
  • Value (val()) –

    Value to associate with the key

    • -type val() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]
  • ExpTime (exp_time()) –
    • Time at which the key will expire, expressed as a Unix time_t().
    • Optional; defaults to 0 (no expiration).
    • -type exp_time() :: time_t()
    • -type time_t() :: integer()
  • Flags (flags_list()) –
    • List of operational flags to apply to the replace operation, and/or custom property flags to associate with the key-value pair in the database. Heavy use of custom property flags is discouraged due to RAM-based storage
    • Optional; defaults to empty list
    • -type flags_list() :: [do_op_flag() | property()]
    • -type do_op_flag() :: {'testset', timestamp()} | 'value_in_ram' {'exp_time_directive', 'keep' | 'replace'} | {'attrib_directive', 'keep' | 'replace'}
    • -type timestamp() = integer()
    • -type property() :: atom() | {term(), term()}
    • Operational flag usage
      • {'testset', timestamp()}
        • Fail the operation if the existing key’s timestamp is not exactly equal to timestamp(). If used inside a link:#brick-simple-do[micro-transaction], abort the transaction if the key’s timestamp is not exactly equal to timestamp()
      • {'exp_time_directive', 'keep' | 'replace'}
        • Default to 'replace'
        • Specifies whether the ExpTime is kept from the old key value pair or replaced with the ExpTime provided in the replace operation
      • {'attrib_directive', 'keep' | 'replace'}
        • Default to 'replace'
        • Specifies whether the custom properties are kept from the old key value pair or replaced with the custom properties provided in the replace operation
        • If kept, the custom properties remain unchanged. If you specify custom properties explicitly in the replace operation, Hibari adds them to the resulting key value pair
        • If replaced, all original custom properties are deleted, and then Hibari adds the custom properties in the replace operation to the resulting key value pair
      • 'value_in_ram'
        • Store the value blob in RAM, overriding the default storage location of the brick

        Note

        'value_in_ram' flag have not been extensively tested

  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success return

Return type:{'ok', timestamp()}

Error returns

Return type:'key_not_exists'
  • The operation failed because the key does not exist
  • -type timestamp() :: integer()
Return type:{'ts_error', timestamp()}
  • The operation failed because the {'testset', timestamp()} flag was used and there was a timestamp mismatch. The timestamp() in the return is the current value of the existing key’s timestamp.
  • timestamp() = integer()
Return type:'invalid_flag_present'
  • The operation failed because an invalid do_op_flag() was found in the Flags argument.
Return type:'brick_not_available'
  • The operation failed because the chain that is responsible for this key is currently length zero and therefore unavailable.
Return type:{{'nodedown',node()},{'gen_server','call',term()}}
  • The operation failed because the server brick handling the request has crashed or else a network partition has occurred between the client and server. The client should resend the query after a short delay, on the assumption that the Admin Server will have detected the failure and taken steps to repair the chain.
  • -type node() :: atom()
Examples

Successful replacement of a key-value pair:

> brick_simple:replace(tab1, <<"foo">>, <<"Goodbye, world!">>).
{ok,1271543165272987}

Failed attempt to replace a key that does not yet exist:

> brick_simple:replace(tab1, <<"key3">>, <<"new and improved value">>).
key_not_exist

Successful replacement of a key-value pair, with value to be stored in RAM regardless of brick’s default storage setting:

> brick_simple:replace(tab1, "foo", "You again, world!", ['value_in_ram']).
{ok,1271543165272987}

Failed attempt to replace a key for which we have incorrectly specified its current timestamp:

> brick_simple:replace(tab1, "foo", "Whole new value", [{'testset', 12345}]).
{ts_error,1271543165272987}

Successful replacement of a key-value pair for which we have correctly specified its current timestamp:

> brick_simple:replace(tab1, "foo", "Whole new value", [{'testset', 1271543165272987}]).
{ok,1271543165272988}

Successful replacement of a key-value pair, using a non-default operation timeout:

> brick_simple:replace(tab1, "foo", "Foo again?", 30000).
{ok,1271543165272989}
brick_simple:set/6

Set Key and Value pair (and optional Flags) in the table Table, regardless of whether or not the key already exists.

brick_simple:set(Table, Key, Value)
brick_simple:set(Table, Key, Value, Flags)
brick_simple:set(Table, Key, Value, Timeout)
brick_simple:set(Table, Key, Value, ExpTime, Flags, Timeout)
Parameters:
  • Table (table()) –

    Name of the table to which to set the key-value pair

    • -type table() :: atom()
  • Key (key()) –

    Key to set in to the table, in association with a paired value

    • -type key() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]

Note

While the Key may be specified as either iolist() or binary(), it will be converted into binary before operation execution. The same is true of Value.

Parameters:
  • Value

    Value to associate with the key

    • -type val() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]
  • ExpTime (exp_time()) –
    • Time at which the key will expire, expressed as a Unix time_t().
    • Optional; defaults to 0 (no expiration).
    • -type exp_time() :: time_t()
    • -type time_t() :: integer()
  • Flags (flags_list()) –
    • List of operational flags to apply to the set operation, and/or custom property flags to associate with the key-value pair in the database. Heavy use of custom property flags is discouraged due to RAM-based storage
    • Optional; defaults to empty list
    • -type flags_list() :: [do_op_flag() | property()]
    • -type do_op_flag() :: {'testset', timestamp()} | 'value_in_ram' | {'exp_time_directive', 'keep' | 'replace'} | {'attrib_directive', 'keep' | 'replace'}
    • -type timestamp() :: integer()
    • -type property() :: atom() | {term(), term()}
    • Operational flag usage
      • {'testset', timestamp()}
        • Fail the operation if the existing key’s timestamp is not exactly equal to timestamp(). If used inside a link:#brick-simple-do[micro-transaction], abort the transaction if the key’s timestamp is not exactly equal to timestamp(). Using this flag with set will result in an error if the key does not already exist or if the key exists but has a non-matching timestamp.
      • {'exp_time_directive', 'keep' | 'replace'}
        • Default to 'replace'
        • Specifies whether the ExpTime is kept from the old key value pair or replaced with the ExpTime provided in the replace operation
      • {'attrib_directive', 'keep' | 'replace'}
        • Default to 'replace'
        • Specifies whether the custom properties are kept from the old key value pair or replaced with the custom properties provided in the set operation
        • If kept, the custom properties remain unchanged. If you specify custom properties explicitly in the set operation, Hibari adds them to the resulting key value pair
        • If replaced, all original custom properties are deleted, and then Hibari adds the custom properties in the set operation to the resulting key value pair
      • 'value_in_ram'
        • Store the value blob in RAM, overriding the default storage location of the brick

        Note

        'value_in_ram' flag have not been extensively tested

  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success return

Return type:{'ok', timestamp()}

Error returns

Return type:'key_not_exists'
  • The operation failed because the {'testset', timestamp()} flag was used and key does not exist
  • -type timestamp() :: integer()
Return type:{'ts_error', timestamp()}
  • The operation failed because the {'testset', timestamp()} flag was used and there was a timestamp mismatch. The timestamp() in the return is the current value of the existing key’s timestamp.
  • timestamp() = integer()
Return type:'invalid_flag_present'
  • The operation failed because an invalid do_op_flag() was found in the Flags argument.
Return type:'brick_not_available'
  • The operation failed because the chain that is responsible for this key is currently length zero and therefore unavailable.
Return type:{{'nodedown',node()},{'gen_server','call',term()}}
  • The operation failed because the server brick handling the request has crashed or else a network partition has occurred between the client and server. The client should resend the query after a short delay, on the assumption that the Admin Server will have detected the failure and taken steps to repair the chain.
  • -type node() :: atom()
Examples

Successful setting of a key-value pair:

> brick_simple:set(tab1, <<"key4">>, <<"cool value">>).
{ok,1271542959131192}

Successful setting of a key-value pair, with value to be stored in RAM regardless of brick’s default storage setting:

> brick_simple:set(tab1, "goo", "value6", ['value_in_ram']).
{ok,1271542959131193}

Failed attempt to set a key-value pair, when we have used the testset flag but the key does not yet exist:

> brick_simple:set(tab1, "boo", "hoo", [{'testset', 1271543165272987}]).
key_not_exist

Successful setting of a key-value pair, when we have used the testset flag and the key does already exist and its timestamp matches our specified timestamp:

> brick_simple:set(tab1, "goo", "value7", [{'testset', 1271543165272432}]).
{ok,1271543165272433}
brick_simple:rename/6

Rename Key, Value pair, and Flags to NewKey in the table Table if the key already exists. The operation will fail if:

  • Key does not already exist
  • ... or Key and NewKey do not share a common key prefix. (See TODO (Creating New Table - VarPrefix) for more details)
brick_simple:rename(Table, Key, NewKey)
brick_simple:rename(Table, Key, NewKey, Flags)
brick_simple:rename(Table, Key, NewKey, Timeout)
brick_simple:rename(Table, Key, NewKey, ExpTime, Flags, Timeout)
Parameters:
  • Table (table()) –

    Name of the table to which to rename the key-value pair

    • -type table() :: atom()
  • Key (key()) –

    Key to rename in to the table, in association with a paired value

    • -type key() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]

Note

While the Key may be specified as either iolist() or binary(), it will be converted into binary before operation execution. The same is true of NewKey

Parameters:
  • NewKey

    NewKey in the table, in association with an existing paired value

    • -type val() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]
  • ExpTime (exp_time()) –
    • Time at which the key will expire, expressed as a Unix time_t().
    • Optional; defaults to 0 (no expiration).
    • -type exp_time() :: time_t()
    • -type time_t() :: integer()
  • Flags (flags_list()) –
    • List of operational flags to apply to the rename operation, and/or custom property flags to associate with the key-value pair in the database. Heavy use of custom property flags is discouraged due to RAM-based storage
    • Optional; defaults to empty list
    • -type flags_list() :: [do_op_flag() | property()]
    • -type do_op_flag() :: {'testset', timestamp()} | 'value_in_ram' | {'exp_time_directive', 'keep' | 'replace'} | {'attrib_directive', 'keep' | 'replace'}
    • -type timestamp() :: integer()
    • -type property() :: atom() | {term(), term()}
    • Operational flag usage
      • {'testset', timestamp()}
        • Fail the operation if the existing key’s timestamp is not exactly equal to timestamp(). If used inside a link:#brick-simple-do[micro-transaction], abort the transaction if the key’s timestamp is not exactly equal to timestamp().
      • {'exp_time_directive', 'keep' | 'replace'}
        • Default to 'keep'
        • Specifies whether the ExpTime is kept from the old key value pair or replaced with the ExpTime provided in the rename operation
      • {'attrib_directive', 'keep' | 'replace'}
        • Default to 'keep'
        • Specifies whether the custom properties are kept from the old key value pair or replaced with the custom properties provided in the rename operation
        • If kept, the custom properties remain unchanged. If you specify custom properties explicitly in the rename operation, Hibari adds them to the resulting key value pair
        • If replaced, all original custom properties are deleted, and then Hibari adds the custom properties in the rename operation to the resulting key value pair
      • 'value_in_ram'
        • Store the value blob in RAM, overriding the default storage location of the brick

        Note

        'value_in_ram' flag have not been extensively tested

  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success return

Return type:{'ok', timestamp()}

Error returns

Return type:'key_not_exists'
  • The operation failed because the key does not exist or because key and the new key are equal
  • -type timestamp() :: integer()
Return type:{'ts_error', timestamp()}
  • The operation failed because the {'testset', timestamp()} flag was used and there was a timestamp mismatch. The timestamp() in the return is the current value of the existing key’s timestamp.
  • timestamp() = integer()
Return type:'invalid_flag_present'
  • The operation failed because an invalid do_op_flag() was found in the Flags argument.
Return type:'brick_not_available'
  • The operation failed because the chain that is responsible for this key and the new key is currently length zero and therefore unavailable.
Return type:{{'nodedown',node()},{'gen_server','call',term()}}
  • The operation failed because the server brick handling the request has crashed or else a network partition has occurred between the client and server. The client should resend the query after a short delay, on the assumption that the Admin Server will have detected the failure and taken steps to repair the chain.
  • -type node() :: atom()
Examples

Successful renaming of a key-value pair:

> brick_simple:rename(tab1, <<"foo">>, <<"bar">>).
{ok,1271543165272987}

Failed attempt to rename a key that does not yet exist:

> brick_simple:rename(tab1, <<"key3">>, <<"bar">>).
key_not_exist

Successful renaming of a key-value pair, with value to be stored in RAM regardless of brick’s default storage setting:

> brick_simple:rename(tab1, "foo", "bar", ['value_in_ram']).
{ok,1271543165272987}

Failed attempt to rename a key for which we have incorrectly specified its current timestamp:

> brick_simple:rename(tab1, "foo", "bar", [{'testset', 12345}]).
{ts_error,1271543165272987}

Successful renaming of a key-value pair for which we have correctly specified its current timestamp:

> brick_simple:rename(tab1, "foo", "bar", [{'testset', 1271543165272987}]).
{ok,1271543165272988}

Successful renaming of a key-value pair, using a non-default operation timeout:

> brick_simple:rename(tab1, "foo", "bar", 30000).
{ok,1271543165272989}
brick_simple:get/4

From table Table, retrieve Key and specified attributes of the key (as determined by Flags).

brick_simple:get(Table, Key)
brick_simple:get(Table, Key, Flags)
brick_simple:get(Table, Key, Timeout)
brick_simple:get(Table, Key, Flags, Timeout)
Parameters:
  • Table (table()) –

    Name of the table from which to retrieve the key-value pair

    • -type table() :: atom()
  • Key (key()) –

    Key to retrieve from to the table

    • -type key() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]

Note

While the Key may be specified as either iolist() or binary(), it will be converted into binary before operation execution

Parameters:
  • Flags (flags_list()) –
    • List of operational flags to apply to the get operation.
    • Optional; defaults to empty list
    • -type flags_list() :: [do_op_flag()]
    • -type do_op_flag() :: 'get_all_attribs' | 'witness' | {'testset', timestamp()} | 'must_exist' | 'must_not_exist'
    • -type timestamp() :: integer()
    • Operational flag usage
      • 'get_all_attribs'
        • Return all attributes of the key. May be used in combination with the witness flag
      • 'witness'
        • Do not return the value blob in the result. This flag will guarantee that the brick does not require disk access to satisfy this request
      • {'testset', timestamp()}
        • Fail the operation if the key’s timestamp is not exactly equal to timestamp(). If used inside a link:#brick-simple-do[micro-transaction], abort the transaction if the key’s timestamp is not exactly equal to timestamp().
        • This flag has priority over the 'must_exist' and 'must_not_exist' flags
      • 'must_exist'
        • For use inside a link:#brick-simple-do[micro-transaction]: abort the transaction if the key does not exist
      • 'must_not_exist'
        • For use inside a link:#brick-simple-do[micro-transaction]: abort the transaction if the key exists. This flag may be useful when the relationship between two or more keys is important to the client application
  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success returns

Return type:{'ok', timestamp(), val()}
  • Success return when the get request uses neither the 'witness' flag nor the 'get_all_attribs' flag
  • -type timestamp() :: integer()
  • -type val() :: iodata()
  • -type iodata() :: iolist() | binary()
  • -type iolist()  :: [char() | binary() | iolist()]
Return type:{'ok', timestamp()}
  • Success return when the get uses 'witness' but not 'get_all_attribs'
Return type:{'ok', timestamp(), exp_time(), proplist()}
  • Success return when the get uses both 'witness' and 'get_all_attribs'
  • -type exp_time() :: time_t()
  • -type proplist() :: [property()]
  • -type property() :: atom() | {term(), term()}
Return type:{'ok', timestamp(), val(), exp_time(), proplist()}
  • Success return when the get uses 'get_all_attribs' but not 'witness'
  • -type exp_time() :: time_t()

Note

When a proplist() is returned, one of the properties in the list will always be {val_len, Size::integer()}, where Size is the size of the value blob in bytes

Error returns

Return type:'key_not_exist'
  • The operation failed because the key does not exist.
Return type:{'ts_error', timestamp()}
  • The operation failed because the {'testset', timestamp()} flag was used and there was a timestamp mismatch. The timestamp() in the return is the current value of the existing key’s timestamp.
Return type:'invalid_flag_present'
  • The operation failed because an invalid do_op_flag() was found in the Flags argument
Return type:'brick_not_available'
  • The operation failed because the chain that is responsible for this key is currently length zero and therefore unavailable.
Return type:{{'nodedown',node()},{'gen_server','call',term()}}
  • The operation failed because the server brick handling the request has crashed or else a network partition has occurred between the client and server. The client should resend the query after a short delay, on the assumption that the Admin Server will have detected the failure and taken steps to repair the chain.
  • -type node() :: atom()
Examples

Successful retrieval of a key-value pair:

> brick_simple:get(tab1, "goo").
{ok,1271543165272432,<<"value7">>}

Successful retrieval of a key without its associated value blob:

> brick_simple:get(tab1, "goo", ['witness']).
{ok,1271543165272432}

Failed attempt to retrieve a key that does not exist:

> brick_simple:get(tab1, "moo").
key_not_exist
brick_simple:get_many/5

Get many keys from a single chain in the table Table, up to a maximum of MaxNum keys. Keys are returned in lexicographic sorting order starting with the first key _after_ the key specified by the Key argument. The return list includes a boolean value indicating whether or not there are more keys after the last key of the return results.

Important

A single get_many() function call cannot be used to retrieve keys from across multiple storage chains. The consistent hash of Key will send the get_many operation to the tail brick in a single chain; all keys returned will come from that single brick only.

brick_simple:get_many(Table, Key, MaxNum)
brick_simple:get_many(Table, Key, MaxNum, Flags)
brick_simple:get_many(Table, Key, MaxNum, Timeout)
brick_simple:get_many(Table, Key, MaxNum, Flags, Timeout)
Parameters:
  • Table (table()) –

    Name of the table to which to retrieve the key-value pair

    • -type table() :: atom()
  • Key (key()) –

    Key after which to start the get_many retrieval, proceeding in lexicographic order with the first key after the specified Key

    • -type key() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]

Note

While the Key may be specified as either iolist() or binary(), it will be converted into binary before operation execution

Parameters:
  • MaxNum (integer()) – Maximum number of keys to return
  • Flags
    • List of operational flags to apply to the get_many operation.
    • Optional; defaults to empty list
    • -type flags_list() :: [do_op_flag()]
    • -type do_op_flag() :: 'get_all_attribs' | 'witness' | {'binary_prefix', binary()} | {'max_bytes', integer()} | {'max_num', integer()}
    • -type timestamp() :: integer()
    • -type property() :: atom() | {term(), term()}
    • Operational flag usage
  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success returns

Return type:

{ok, {[{key(), timestamp(), val()}], boolean()}}

  • Success return when the get_many request uses neither the 'witness' flag nor the 'get_all_attribs' flag
  • -type timestamp() :: integer()
  • -type val() :: iodata()
  • -type iodata() :: iolist() | binary()
  • iolist() :: [char() | binary() | iolist()]

Return type:

{ok, {[{key(), timestamp()}], boolean()}}

  • Success return when the get_many uses 'witness' but not 'get_all_attribs'

Return type:

{ok, {[{key(), timestamp(), exp_time(), proplist()}], boolean()}}

  • Success return when the get_many uses both 'witness' and 'get_all_attribs'
  • -type exp_time() :: time_t()
  • -type proplist() :: [property()]
  • property() :: atom() | {term(), term()}

Trype:

{ok, {[{key(), timestamp(), val(), exp_time(), proplist()}], boolean()}}

  • Success return when the get_many uses 'get_all_attribs' but not 'witness'
  • exp_time() :: time_t()

Note

The boolean at the end of the success return indicates whether or not the chain has more keys lexicographically after the last key in the return (true for yes, false for no). When a proplist() is returned, one of the properties in the list will always be {val_len, Size::integer()}, where Size is the size of the value blob in bytes.

Error returns

Return type:'invalid_flag_present'
  • The operation failed because an invalid do_op_flag() was found in the Flags argument.
Return type:'brick_not_available'
  • The operation failed because the chain that is responsible for this key is currently length zero and therefore unavailable.
Return type:{{'nodedown',node()},{'gen_server','call',term()}}
  • The operation failed because the server brick handling the request has crashed or else a network partition has occurred between the client and server. The client should resend the query after a short delay, on the assumption that the Admin Server will have detected the failure and taken steps to repair the chain.
  • -type node() :: atom()
Examples

Successful retrieval of all keys from a table that currently has only two keys. The boolean false’ indicates that there are no keys following the ``foo` key:

> brick_simple:get_many(tab1, "", 5).
{ok,{[{<<"another">>,1271543102911775,<<"yes!">>},
      {<<"foo">>,1271543165272987,<<"Foo again?">>}],
     false}}

Successful retrieval of all keys from a table that currently has only two keys, using the witness flag in the request:

> brick_simple:get_many(tab1, "", 5, ['witness']).
{ok,{[{<<"another">>,1271543102911775},
      {<<"foo">>,1271543165272987}],
     false}}

Successful retrieval of all keys from a table that currently has only two keys, using the get_all_attribs flag in the request.:

> brick_simple:get_many(tab1, "", 5).
{ok,{[{<<"another">>,1271543102911775,<<"yes!">>,0,[{val_len,4}]},
      {<<"foo">>,1271543165272987,<<"Foo again?">>,0,[{val_len,6}]}],
     false}}
brick_simple:delete/4

Delete key Key from the table Table. The operation will fail if Key does not already exist

brick_simple:delete(Table, Key)
brick_simple:delete(Table, Key, Flags)
brick_simple:delete(Table, Key, Timeout)
brick_simple:delete(Table, Key, Flags, Timeout)
Parameters:
  • Table (table()) –

    Name of the table from which to delete the key-value pair

    • -type table() :: atom()
  • Key (key()) –

    Key to delete from the table

    • -type key() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]

Note

While the Key may be specified as either iolist() or binary(), it will be converted into binary before operation execution

Parameters:
  • Flags (flags_list()) –
    • List of operational flags to apply to the delete operation.
    • Optional; defaults to empty list
    • -type flags_list() :: [do_op_flag()]
    • -type do_op_flag() :: {'testset', timestamp()} | 'must_exist' | 'must_not_exist'
    • -type timestamp() :: integer()
    • Operational flag usage
      • {'testset', timestamp()}
        • Fail the operation if the existing key’s timestamp is not exactly equal to timestamp(). If used inside a link:#brick-simple-do[micro-transaction], abort the transaction if the key’s timestamp is not exactly equal to timestamp(). This flag has priority over the 'must_exist' and 'must_not_exist' flags
      • 'must_exist'
        • For use inside a link:#brick-simple-do[micro-transaction]: abort the transaction if the key does not exist
      • 'must_not_exist'
        • For use inside a link:#brick-simple-do[micro-transaction]: abort the transaction if the key exists. This flag may be useful when the relationship between two or more keys is important to the client application
  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success return

Return type:'ok'

Error returns

Return type:'key_not_exist'
  • The operation failed because the key does not exist
Return type:{'ts_error', timestamp()}
  • The operation failed because the {'testset', timestamp()} flag was used and there was a timestamp mismatch. The timestamp() in the return is the current value of the existing key’s timestamp.
  • timestamp() = integer()
Return type:'invalid_flag_present'
  • The operation failed because an invalid do_op_flag() was found in the Flags argument.
Return type:'brick_not_available'
  • The operation failed because the chain that is responsible for this key is currently length zero and therefore unavailable.
Return type:{{'nodedown',node()},{'gen_server','call',term()}}
  • The operation failed because the server brick handling the request has crashed or else a network partition has occurred between the client and server. The client should resend the query after a short delay, on the assumption that the Admin Server will have detected the failure and taken steps to repair the chain.
  • -type node() :: atom()
Examples

Successful deletion of a key and its associated value and attributes:

> brick_simple:delete(tab1, <<"foo">>).
ok

Failed attempt to delete a key that does not exist:

> brick_simple:delete(tab1, "key6").
key_not_exist

Failed attempt to delete a key for which we have incorrectly specified its current timestamp:

> brick_simple:delete(tab1, "goo", [{'testset', 12345}]).
{ts_error,1271543165272987}

Successful deletion of a key for which we have correctly specified its current timestamp:

> brick_simple:delete(tab1, "goo", [{'testset', 1271543165272987}]).
ok

Successful deletion of a key, using a non-default operation timeout:

> brick_simple:delete(tab1, "key3", 30000).
ok
brick_simple:do/4

Send a list of primitive operations to the table Table. They will be executed at the same time by a Hibari brick. If the first item in the OpList is brick_server:make_txn() then the list of operations is executed in the context of a micro-transaction: either all operations will be executed successfully or none will be executed.

We term these “micro”-transactions because they are subject to certain limitations that apply to all operations that use the brick_simple:do() API:

  • All impacted keys must be in the same table.
  • All impacted keys must be in the same chain.
  • All operations in the transaction must be sent in a single brick_simple:do() call. Unlike some other databases, it is not possible to request a transaction handle and to add operations to that transaction in an one-by-one, “ad hoc” manner.

For further information about micro-transactions, see link:hibari-sysadmin-guide.en.html#micro-transactions[Hibari System Administrator’s Guide, “Micro-Transactions” section].

brick_simple:do(Table, OpList)
brick_simple:do(Table, OpList, Timeout)
brick_simple:do(Table, OpList, OpFlags, Timeout)
Parameters:
  • Table (table()) –

    Name of the table in which to perform the operations

    • -type table() :: atom()
  • OpList (do_op_list()) –
    • List of primitive operations to perform. Each primitive is invoked using the brick_server:make_*() API
    • -type do_op_list() :: [do1_op()]
    • -type do1_op() ::
      • brick_server:make_add(Key, Value, ExpTime, Flags)
      • brick_server:make_replace(Key, Value, ExpTime, Flags)
      • brick_server:make_set(Key, Value, ExpTime, Flags)
      • brick_server:make_rename(Key, NewKey, ExpTime, Flags)
      • brick_server:make_get(Key, Flags)
      • brick_server:make_get_many(Key, Flags)
      • brick_server:make_delete(Key, Flags)
      • brick_server:make_txn()
        • Include brick_server:make_txn() as the first item in your OpList if you want the do operation to be executed as an atomic transaction
        • Note that the arguments for each primitive are the same as those for the primitives when they are executed on their own, with the exclusion of the Tab and Timeout arguments, both of which serve as arguments to the overall do operation rather than as arguments to the primitives. For example, an add on its own is brick_simple:add(Tab, Key, Value, ExpTime, Flags, Timeout), whereas in the context of a do operation an add primitive is brick_server:make_add(Key, Value, ExpTime, Flags)
        • For further information about each primitive, see link:#brick-simple-add[brick_simple:add/6], link:#brick-simple-replace[brick_simple:replace/6], link:#brick-simple-set[brick_simple:set/6], link:#brick-simple-rename[brick_simple:rename/6], link:#brick-simple-get[brick_simple:get/4], link:#brick-simple-get-many[brick_simple:get_many/5], and link:#brick-simple-delete[brick_simple:delete/4]
  • OpFlags (do_flags_list()) –
    • List of operational flags to apply to the overall do operation.
    • Optional; defaults to empty list
    • -type do_flags_list() :: [do_flag()]
    • -type do_flag() :: 'fail_if_wrong_role' | 'ignore_role'
    • Operational flag usage
      • 'fail_if_wrong_role'
        • If the ‘do’ operation is sent to the wrong brick in the target chain (e.g. a ‘read’ request mistakenly sent to the ‘head’ brick or a ‘write’ request mistakenly sent to the ‘tail’ brick), fail the transaction immediately. If this flag is not used, the default behavior is for the incorrect brick to forward the request to the correct brick
      • 'ignore_role'
        • If this flag is used, then whichever brick receives the request will reply to the request directly, regardless of the brick’s assigned role
  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success return

Return type:[do1_res_ok]
  • List of do1_res_ok, one for each primitive operation specified in the do request. Return list order corresponds to the order in which primitive operations are listed in the request’s OpList. Note that if the do request does not use transaction semantics, then some individual primitive operations may fail without the overall do operation failing
  • Within the return list, possible do1_res_ok returns to each individual primitive operation are the same as the possible returns that the primitive operation type could generate if it were executed on its own. For example, within the do operation’s success return list, the possible returns for a primitive add operation are the same as the returns described in the link:#brick-simple-add[brick_simple:add/6] section; potential returns to a primitive replace operation are the same as those described in the link:#brick-simple-replace[brick_simple:replace/6] section; and likewise for link:#brick-simple-set[set], likewise for link:#brick-simple-rename[rename], link:#brick-simple-get[get], link:#brick-simple-get-many[get_many], and link:#brick-simple-delete[delete].

Error returns

Return type:{txn_fail, [{integer(), do1_res_fail()}]}
  • Operation failed because transaction semantics were used in the do request and one or more primitive operations within the transaction failed. The integer() identifies the failed primitive operation by its position within the request’s OpList. For example, a 2 indicates that the second primitive listed in the request’s OpList failed. Note that this position identifier does not count the txn() specifier at the start of the OpList.
  • do1_res_fail() indicates the type of failure for the failed primitive operation. Possibilities are:
    • {'key_exists', timestamp()}
      • -type timestamp() :: integer()
    • 'key_not_exist'
    • {'ts_error', timestamp()}
    • 'invalid_flag_present'
Return type:'invalid_flag_present'
  • The operation failed because an invalid do_flag() was found in the do request’s OpFlags argument. Note this is a different error than an invalid flag being found within an individual primitive
Return type:'brick_not_available'
  • The operation failed because the chain that is responsible for this key is currently length zero and therefore unavailable
Return type:{{'nodedown',node()},{'gen_server','call',term()}}
  • The operation failed because the server brick handling the request has crashed or else a network partition has occurred between the client and server. The client should resend the query after a short delay, on the assumption that the Admin Server will have detected the failure and taken steps to repair the chain
  • -type node() :: atom()
Examples

Successful do operation adding two new keys to table tab1, without transaction semantics:

> brick_simple:do(tab1, [brick_server:make_add("foo3", "bar3"),
                         brick_server:make_add("foo4", "bar4")]).
[ok,ok]

Successful creation of two get primitives Do1` and ``Do2`, and their subsequent combination into a ``do request, without transaction semantics:

> Do1 = brick_server:make_get("foo").
{get,<<"foo">>,[]}
> Do2 = brick_server:make_get("foo2").
{get,<<"foo2">>,[]}
> brick_simple:do(tab1, [Do1, Do2]).
[{ok,1271543102911775,<<"Foo again?">>},key_not_exist]

Failed operation with transaction semantics. Because transaction semantics are used, the failure of the primitive Do2b causes the entire operation to fail:

> Do1b = brick_server:make_get("foo").
{get,<<"foo">>,[]}
> Do2b = brick_server:make_get("foo2", [must_exist]).
{get,<<"foo2">>,[must_exist]}
> brick_simple:do(tab1, [brick_server:make_txn(), Do1b, Do2b]).
{txn_fail,[{2,key_not_exist}]}
brick_simple:fold_table/7

Attempt a fold operation across all keys in a table. For general information about the Erlang fold function that underlies this operations, see http://www.erlang.org/doc/man/lists.html#foldl-3.

Important

Do not execute this operation while a data migration is being performed

brick_simple:fold_table(Table, Fun, Acc, NumItems, Flags)
brick_simple:fold_table(Table, Fun, Acc, NumItems, Flags, MaxParallel)
brick_simple:fold_table(Table, Fun, Acc, NumItems, Flags, MaxParallel, Timeout)
Parameters:
  • Table (table()) –

    Name of the table across which to perform the fold operation

    • -type table() :: atom()
  • Fun (fun_arity_2()) –

    Function to apply to successive elements of the list

    • -type fun_arity_2() :: fun(({ChainName, TupleFromGetMany}, Acc) -> Acc)
      • TupleFromGetMany is a single result tuple from a link:#brick-simple-get-many[brick_simple:get_many()] result. Its format can vary according to the Flags argument, which is passed as-is to a get_many() call. For example, if Flags = [], then TupleFromGetMany will match {Key, TS, Value}. If Flags = [witness], then TupleFromGetMany will match {Key, TS}
    • Acc
      • The accumulator term
  • Acc (term()) – Initial value of the accumulator term
  • NumItems (integer()) – Batch size used for get_many operations used by the fold function
  • Flags (flags_list()) –
    • List of operational flags to apply to the fold_table operation, The supported flags are the same as those for link:#brick-simple-get-many[brick_simple:get_many()]
    • -type flags_list() :: [do_op_flag() | property()]
    • -type do_op_flag() :: 'get_all_attribs' | 'witness' {'binary_prefix', binary()} | {'max_bytes', integer()}
    • -type property() :: atom() | {term(), term()}
    • Operational flag usage
      • 'get_all_attribs'
        • Return all attributes of each key. May be used in combination with the witness flag
    • 'witness'
      • Do not return the value blobs in the result. This flag will guarantee that the brick does not require disk access to satisfy this request
    • {'binary_prefix', binary()}
      • Return only keys that have a binary prefix that is exactly equal to binary()
    • {'max_bytes', integer()}
      • Return only as many keys as the sum of the sizes of their corresponding value blobs does not exceed integer() bytes
  • MaxParallel (integer()) –
    • If MaxParallel = 0, a true fold will be performed. If MaxParallel >= 1, then an independent fold will be performed on each chain, with up to MaxParallel number of folds running in parallel. The result from each chain fold will be returned to the caller as-is, i.e. will not be combined like in a “reduce” phase of a map-reduce cycle
    • Optional; defaults to 0
  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success return

Return type:{ok, Acc::term(), Iterations::integer()}

Error return

Return type:{error, Error::term(), Acc::term(), Iterations::integer()}
Examples

to be added

brick_simple:fold_key_prefix/9

For a binary key prefix Prefix, fold over all keys in table Table starting with StartKey, sleeping for SleepTime milliseconds between iterations and using Flags and NumItems as arguments to link:#brick-simple-get-many[brick_simple:get_many()]. For general information about the Erlang fold function that underlies this operations, see http://www.erlang.org/doc/man/lists.html#foldl-3.

Important

Do not execute this operation while a data migration is being performed

brick_simple:fold_key_prefix(Table, Prefix, Fun, Acc, Flags)
brick_simple:fold_key_prefix(Table, Prefix, StartKey, Fun, Acc, Flags, NumItems, SleepTime, Timeout)
Parameters:
  • Table (table()) –

    Name of the table in which to perform the fold operation

    • -type table() :: atom()
  • Prefix (binary()) – Key prefix for which to perform the fold operation
  • StartKey (binary()) –
    • Key at which to initiate the fold operation
    • Optional; defaults to equal your specified Prefix
  • Fun (fun_arity_2()) –

    Function to apply to successive elements of the list

    • -type fun_arity_2() :: fun(({ChainName, TupleFromGetMany}, Acc) -> Acc)
      • TupleFromGetMany is a single result tuple from a link:#brick-simple-get-many[brick_simple:get_many()] result. Its format can vary according to the Flags argument, which is passed as-is to a get_many() call. For example, if Flags = [], then TupleFromGetMany will match {Key, TS, Value}. If Flags = [witness], then TupleFromGetMany will match {Key, TS}
    • Acc
      • The accumulator term
  • Acc (term()) – Initial value of the accumulator term
  • Flags (flags_list()) –
    • List of operational flags to apply to the fold_key_prefix operation. The supported flags are the same as those for link:#brick-simple-get-many[brick_simple:get_many()], excluding the {'binary_prefix', binary()} flag. This flag is inappropriate since the key prefix is passed directly through the Prefix argument of brick_simple:fold_key_prefix()
    • -type flags_list() :: ['get_all_attribs' | 'witness' | {'max_bytes', integer()}]
    • Operational flag usage
      • 'get_all_attribs'
        • Return all attributes of each key. May be used in combination with the witness flag
      • 'witness'
        • Do not return the value blobs in the result. This flag will guarantee that the brick does not require disk access to satisfy this request
      • {'max_bytes', integer()}
        • Return only as many keys as the sum of the sizes of their corresponding value blobs does not exceed integer() bytes
  • NumItems (integer()) – Batch size used for get_many operations used by the fold function
  • SleepTime (integer()) –
    • Sleep time between interations, in milliseconds
    • Optional; defaults to 0
  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success return

Return type:{ok, Acc::term(), Iterations::integer()}

Error return

Return type:{error, Error::term(), Acc::term(), Iterations::integer()}
Examples

to be added

Client API: UBF

link:http://github.com/ubf/ubf[The UBF protocol] is a formally-specified family of protocols that are supported by a large number of client languages. This section attempts to describe the layers of the UBF protocol stack, how to use the UBF client in Erlang and other languages, and how to use that client to access a Hibari storage cluster.

The Hibari source distribution includes UBF/EBF protocol support for the following languages:

  • Erlang, see xref:using-ubf-erlang-client[]
  • Java, see xref:using-ubf-java-client[]
  • Python, see xref:using-ubf-python-client[]

[[hibari-server-impl-of-ubf-proto-stack]]

The Hibari Server’s Implementation of the UBF Protocol Stack
UBF(A): Bottom Layer, transport and session protocol layer

This layer plays the same basic role as many other serialized data transport protocols that use TCP for host-to-host transport, such as link:http://en.wikipedia.org/wiki/Open_Network_Computing_Remote_Procedure_Call[ONC-RPC], link:http://en.wikipedia.org/wiki/IIOP[CORBA IIOP], link:http://en.wikipedia.org/wiki/Protocol_buffers[Protocol Buffers], and link:http://en.wikipedia.org/wiki/Thrift_(protocol)[Thrift].

Hibari servers support several of these session protocols on top of a TCP/IP transport protocol. The choice of session protocol is a matter of convenience and/or support for the application developer. Hibari should be as easy for an app developer to use Ruby and JSON-RPC as it is to use Python and Thrift or EBF.

  • UBF(A), Joe Armstrong’s original session layer protocol
  • EBF, the Erlang Binary Format. The session layer protocol is a thin, efficient that uses the Erlang BIFs term_to_binary() and binary_to_term() to serialize Erlang data terms. This protocol is very closely related to the link:http://bert-rpc.org/[BERT protocol].
  • JSON over TCP, also called JSF (the JavaScript Format). Erlang terms are encoded as link:http://en.wikipedia.org/wiki/JSON[JSON terms] and transmitted directly over a TCP transport. This protocol is not in common use but is easy to implement in the UBF server framework.
  • HTTP, the link:http://en.wikipedia.org/wiki/HTTP[Hypertext Transfer Protocol]. This protocol is used to support Hibari’s link:http://en.wikipedia.org/wiki/JSON-RPC[JSON-RPC] server.
  • link:http://en.wikipedia.org/wiki/Thrift_(protocol)[Thrift]. Similar to EBF, except that Thrift’s binary encoding is used for the wire protocol instead of UBF(A) or Erlang’s native wire formats.
  • link:http://en.wikipedia.org/wiki/Protocol_buffers[Protocol Buffers]. Similar to EBF, except that Google’s Protocol Buffers binary encoding is used for the wire protocol instead of UBF(A) or Erlang’s native wire formats. Hibari support is experimental (i.e. not yet implemented).
  • link:http://hadoop.apache.org/avro/docs/current/[Avro]. Similar to EBF, except that Avro’s binary encoding is used for the wire protocol instead of UBF(A) or Erlang’s native wire formats. Hibari support is experimental (i.e. not yet implemented).
UBF(B): Middle Layer, the “contract”

UBF(B) is a programming language for describing types in UBF(A) and protocols between clients and servers. UBF(B) is roughly equivalent to to Verified XML, XML-schemas, SOAP and WDSL.

This layer enforces a protocol “contract”, a formal specification of all data sent by the client and by the server. Any data that does not precisely conform to the protocol is rejected by the contract checker (which is embedded in the server). If the client wishes, it may also use the contract checker to validate data sent by the server, though this not commonly done.

UBF(C): Top Layer, the UBF Metaprotocol

The metaprotocol is used at the beginning of a UBF session to select one of the UBF(B) contracts that the TCP listener is capable of offering. At the moment, Hibari servers support only the “gdss” contract, but other contracts may be added in the future.

[[ubf-representation-of-strings]]

UBF representation of strings vs. binaries

The Erlang language does not have a data type specifically for strings. Instead, strings are typically represented as lists of integers (ASCII byte values) and/or binaries.

A UBF contract makes a distinction between a string, list, and binary. In the case of a string, UBF(A) encodes a string using the notation {'#S', "Hello, world!"} to represent the string “Hello, world!”.

This string encoding is cumbersome to use for developers; in Erlang, the ubf.hrl header file includes a macro ?S("Hello, world!") as a slightly less ugly shortcut. When using other languages, the 2-tuple and the atom '#S' would be created as any other 2-tuple and atom.

Fortunately, there is only one case where the string type is necessary: using the startSession metaprotocol command to start using the Hibari data server contract. An example will be shown below.

[[using-ubf-in-any-language]]

Steps for Using a UBF-based Protocol in Any Language

The steps to use a UBF-based protocol are the same in any language.

  1. Create a connection to the UBF server.
    • ... or the EBF server, or the JSON-RPC server, or the Thrift server, or the ....
  2. Use the UBF metaprotocol to start using the gdss contract, i.e. the Hibari server contract.
  3. Send one or more Hibari server queries and decode the respective server responses.
  4. Close the connection to the UBF server.

[[the-hibari-ubf-protocol-contract]]

The Hibari UBF Protocol Contract

The Hibari UBF Protocol contract can be found in the file ubf_gdss_plugin.con.

Note

See the Hibari source code for the most up-to-date version of this file. link:./misc-codes/ubf_gdss_plugin.con[This documentation has a copy of ubf_gdss_plugin.con], though it may be slightly out-of-date.

The names of the UBF types specified in the contract may differ slightly from the names of the types used in this document’s xref:client-api-erlang[]. For example, the UBF contract calls the key expiration time time exp_time(), while the type system in this document calls it expiry(). However, in all cases of slightly different names, the fundamental data type that both names use is the same: e.g. integer() for expiration time.

For each command, the UBF contract uses the following naming conventions:

  • CommandName_req() for the request sent from client -> server, e.g. set_req() for the set command.
  • CommandName_res() for the response sent from server -> client, e.g. set_res() for the set response.

The general form of a UBF RPC call is a tuple. The first element in the tuple is the name of the command, and the following elements are arguments for that command. The response can be any Erlang term, but the Hibari contract will only return the atom or tuple types.

The following is a mapping of UBF client request type to its Erlang API function, in alphabetical order.:

  • add_req() -> brick_simple:add(), see xref:brick-simple-add[].
  • delete_req() -> brick_simple:delete(), see xref:brick-simple-delete[].
  • do_req() -> brick_simple:do(), see xref:brick-simple-do[].
  • get_req() -> brick_simple:get(), see xref:brick-simple-get[].
  • get_many_req() -> brick_simple:get_many(), see xref:brick-simple-get-many[].
  • replace_req() -> brick_simple:replace(), see xref:brick-simple-replace[].
  • set_req() -> brick_simple:set(), see xref:brick-simple-set[].
  • rename_req() -> brick_simple:rename(), see xref:brick-simple-rename[].

[[using-ubf-erlang-client]]

Using the UBF Client Library for Erlang

Important

  1. When using the Erlang shell for experimentation & prototyping, that shell must have the path to the Erlang UBF client library in its search path. The easiest way to do this is to use the arguments -pz /path/to/ubf/library/ebin to your Erlang shell’s erl command.
  2. When writing code, the statement -include("ubf.hrl"). at the top of your source module to gain access to the ?S() macro. Due to limitations in the Erlang shell, macros cannot be used in the shell.

As outlined in xref:using-ubf-in-any-language[], the first step is to create a connection to a Hibari server. If the Hibari cluster has multiple nodes, then it doesn’t matter which one that you connect to: all nodes can handle any UBF request and will route the query to the proper brick.

  1. Create a connection to the UBF server (on “localhost” TCP port 7581):

    (asdf@bb3)54> {ok, P1, _} = ubf_client:connect("localhost", 7581, [{proto, ubf}], 5000).
    {ok,<0.139.0>,{'#S', "gdss_meta_server"}}
    

    The second step is to use the UBF metaprotocol to select the Hibari server, contract, called “gdss”, for all further commands for this connection.

    Tip

    The Hibari server contract is “stateless”. All replies terms from the ubf_client:rpc/2 function use the form {reply,ServerReply,UBF_StateName}. Because the Hibari server contract is stateless, the UBF_StateName will always be the atom none.

  2. Use the UBF metaprotocol to request the “gdss” contract:

    (asdf@bb3)55> ubf_client:rpc(P1, {startSession, {'#S', "gdss"}, []}).
    {reply,{ok,ok},none}
    

    Now that the UBF connection is set up, we can use it to set a key “foo”.

  3. Set the key “foo” in table tab1 with the value “foo val”, no expiration time, no flags, and a timeout of 5 seconds:

    (asdf@bb3)59> ubf_client:rpc(P1, {set, tab1, <<"foo">>, <<"foo val">>, 0, [], 5000}).
    {reply,ok,none}
    

    Note

    Note that the return value of both the set_req() (in the example above) and get_req() (in the example below) return the same types described in the xref:brick-simple-set[] and xref:brick-simple-get[], respectively.

    The only difference is that the ubf_client:rpc/2 function wraps the server’s reply in a 3-tuple: {reply,ServerReply,none}.

  4. Get the key “foo” in table tab1, timeout in 5 seconds:

    (asdf@bb3)66> ubf_client:rpc(P1, {get, tab1, <<"foo">>, [], 5000}).
    {reply,{ok,1273009092549799,<<"foo val">>},none}
    

    If the client sends a request that violates the contract, the server will tell you, as in this example.

  5. Send a contract-violating request:

    (asdf@bb3)89> ubf_client:rpc(P1, {bbb, 3000}).
    {reply,{clientBrokeContract,{bbb,3000},[]},none}
    

    When you are done with the connection, it is polite to close the connection explicitly. The server will quietly clean up its side of the connection if the client forgets to call or cannot call stop/1.

  6. Close the UBF connection:

    (asdf@bb3)92> ubf_client:stop(P1).
    ok
    

[[using-ubf-java-client]]

Using the UBF Client Library for Java

The source code for the UBF client library for Java is included in the UBF source repository at link:http://github.com/ubf/ubf[http://github.com/ubf/ubf], in the priv/java subdirectory.

Compiling the UBF client library for Java
  1. Please update your UBF client library code to the “master” branch for a date after 10 May 2010, or use the Git tag “v1.14” or later. Versions of the library before 10 May 2010 and tag “v1.14” have several bugs that will prevent the UBF client from working correctly.
  2. Change directory to the priv/java directory of the UBF client library source distribution.
  3. Run make.
  4. (Optional) Copy the class files in the classes subdirectory to a suitable directory for your Java development environment.
Compiling the UBF client library test program HibariTest.java
  1. Change directory to the gdss-ubf-proto/priv/java subdirectory in the Hibari source distribution.

  2. Edit the Makefile to change the UBF_CLASSES_DIR variable to point to the priv/java/classes subdirectory of the UBF package’s source code (or the subdirectory where those classes have been formally installed on your system).

  3. Run the following two make commands. The second assumes that the Hibari server’s UBF server is on the local machine, “localhost”:

    $ make HibariTest
    $ make run-HibariTest
    
  4. If the Hibari server is not running on the local machine, then run make -n run-HibariTest to show the java command that is used to run the test program. Cut-and-paste the command into your shell, then edit the last argument to specify the hostname of a Hibari server.

Examining the HibariTest.java test program

The main() function does three things:

  1. Create a new UBF connection to a Hibari server (hostname/IP address is specified in the first command line argument) and requests the gdss contract via the UBF metaprotocol.
  2. Run the small test cases in the test_hibari_basics() method.
  3. Close the UBF session and exit.
The ubf.HibariTest.main() method
public class HibariTest {

    public static void main(String[] args) throws Exception {
        Socket sock = null;
        UBFClient ubf = null;

        try {
            sock = new Socket(args[0], 7581);
            ubf = UBFClient.new_via_sock(new UBFString("gdss"), new UBFList(),
                    new FooHandler(), sock);
        } catch (Exception e) {
            System.out.println(e);
            System.exit(1);
        }

        test_hibari_basics(ubf);

        ubf.stopSession();
        System.out.println("Success, it works");
        System.exit(0);
    }
    /* ... */
 }

The test_hibari_basics() method performs the same basic UBF operations as the Python EBF demonstration script described in xref:using-ubf-python-client[]. Unlike the Python demo script, the demo program does not use the Hibari do() command but rather then single-operation commands like get( and set().

  1. Delete the key foo from table tab1:

    public static void test_hibari_basics(UBFClient ubf)
            throws IOException, UBFException {
        // setup
        UBFObject res1 = ubf.rpc(
               UBF.tuple( new UBFAtom("delete"), new UBFAtom("tab1"),
                          new UBFBinary("foo"), new UBFList(),
                          new UBFInteger(4000)));
        System.out.println("Res 1:" + res1.toString());
    
  2. Add the key foo to table tab1:

    // add - ok
    UBFObject res2 = ubf.rpc(
            UBF.tuple( new UBFAtom("add"), atom_tab1,
                        new UBFBinary("foo"), new UBFBinary("bar"),
                        new UBFInteger(0), new UBFList(),
                        new UBFInteger(4000)));
    System.out.println("Res 2:" + res2.toString());
    if (! res2.equals(atom_ok))
        System.exit(1);
    
  3. Add the key foo to table tab1 again, this time expecting a failure:

    // add - ng
    UBFObject res3 = ubf.rpc(
            UBF.tuple( new UBFAtom("add"), atom_tab1,
                       new UBFBinary("foo"), new UBFBinary("bar"),
                       new UBFInteger(0), new UBFList(),
                       new UBFInteger(4000)));
    System.out.println("Res 3:" + res3.toString());
    if (! ((UBFTuple)res3).value[0].equals(atom_key_exists))
        System.exit(1);
    
  4. Get the key foo from table tab1:

    // get - ok
    UBFObject res4 = ubf.rpc(
            UBF.tuple( new UBFAtom("get"), atom_tab1,
                       new UBFBinary("foo"), new UBFList(),
                       new UBFInteger(4000)));
    System.out.println("Res 4:" + res4.toString());
    if (! ((UBFTuple)res4).value[0].equals(atom_ok) ||
        ! ((UBFTuple)res4).value[2].equals("bar"))
        System.exit(1);
    
  5. Set the key foo in table tab1 to bar bar:

    // set - ok
    UBFObject res5 = ubf.rpc(
            UBF.tuple( new UBFAtom("set"), atom_tab1,
                       new UBFBinary("foo"), new UBFBinary("bar bar"),
                       new UBFInteger(0), new UBFList(),
                       new UBFInteger(4000)));
    System.out.println("Res 5:" + res5.toString());
    if (! res5.equals(atom_ok))
        System.exit(1);
    
  6. Get foo again and verify that the value is bar bar:

    // get - ok
    UBFObject res6 = ubf.rpc(
            UBF.tuple( new UBFAtom("get"), atom_tab1,
                       new UBFBinary("foo"), new UBFList(),
                       new UBFInteger(4000)));
    System.out.println("Res 6:" + res6.toString());
    if (! ((UBFTuple)res6).value[0].equals(atom_ok) ||
        ! ((UBFTuple)res6).value[2].equals("bar bar"))
        System.exit(1);
    
The UBF event handler interface

Each UBFClient instance uses a separate thread to read data from the server and do any of the following:

  1. Signal to the other thread that a synchronous RPC response was received from the server.
  2. Run a callback function when an event_out asynchronous event is received from the server.
  3. The socket was closed unexpectedly.

In cases #2 and #3, a class that implements the UBFEventHandler interface is used to define the action to be taken in those cases.

The HibariTest.java contains a sample implementation of callback functions for asynchronous events. A real application would probably want to do something much more helpful than this example does.

public static class FooHandler implements UBFEventHandler {
    public FooHandler() {
    }
    public void handleEvent(UBFClient client, UBFObject event) {
        System.out.println("Hey, got an event: " + event.toString());
    }
    public void connectionClosed(UBFClient client) {
        System.out.println("Hey, connection closed, ignoring it\n");
    }
}

Tip

See xref:the-ubf-hibaritest-main-method[] for an example that uses this FooHandler class.

[[using-ubf-python-client]]

Using the EBF Client Library for Python

The source code for the EBF client library for Python is included in the UBF source repository at link:http://github.com/ubf/ubf[http://github.com/ubf/ubf], in the priv/python subdirectory.

NOTE: Recall that the EBF protocol is very closely related to UBF. The only significant difference is the “layer 5” session protocol layer: instead of using the UBF(A) protocol, the EBF (Erlang Binary Format) protocol is used instead. See xref:hibari-server-impl-of-ubf-proto-stack[] for more details.

In addition, you will need the “py_interface” package, developed by Tomas Abrahamsson and others. “py-interface” is distributed under the link:http://www.fsf.org/licensing/education/licenses/lgpl.html[GNU Library General Public License]. A git repository is hosted at repo.or.cz. To clone it and build it, use:

$ git clone git://repo.or.cz/py_interface.git
$ cd py_interface
$ autoconf
$ ./configure
$ make
$ pwd

Use the output of the last command, pwd, to remember the full directory path to the “py-interface” library. The example below assumes that path is /tmp/py-interface.

The pyebf.py file contains a small unit test that makes several calls to the Hibari UBF contract’s do_req() command. The results of (almost) every command are verified using the assert function.

env PYTHONPATH=/path/to/py_interface python pyebf.py
  1. Connect to the Hibari server on “localhost” TCP port 7580 and use the UBF metaprotocol to switch to the gdss contract:

    ## login
    ebf.login('gdss', 'gdss_meta_server')
    
  2. Delete the key 'foo' from table tab1:

    ## setup
    req0 = (Atom('do'), Atom('tab1'), [(Atom('delete'), 'foo', [])], [], 1000)
    res0 = ebf.rpc('gdss', req0)
    
  3. Get the key 'foo' from table tab1:

    ## get - ng
    req1 = (Atom('do'), Atom('tab1'), [(Atom('get'), 'foo', [])], [], 1000)
    res1 = ebf.rpc('gdss', req1)
    assert res1[0] == 'key_not_exist'
    
  4. Add the key 'foo' to table tab1. The do_req() interface requires managing the timestamp integers explicitly by the client; the timestamp 1 is used here:

    ## add - ok
    req2 = (Atom('do'), Atom('tab1'),
            [(Atom('add'), 'foo', 1, 'bar', 0, [])], [], 1000)
    res2 = ebf.rpc('gdss', req2)
    assert res2[0] == 'ok'
    
  5. Add the key 'foo' to table tab1:

    ## add - ng
    req3 = (Atom('do'), Atom('tab1'),
            [(Atom('add'), 'foo', 1, 'bar', 0, [])], [], 1000)
    res3 = ebf.rpc('gdss', req3)
    assert res3[0][0] == 'key_exists'
    assert res3[0][1] == 1
    
  6. Get the key 'foo' from table tab1, verifying that the timestamp is still 1 and value is still 'bar':

    ## get - ok
    req4 = (Atom('do'), Atom('tab1'), [(Atom('get'), 'foo', [])], [], 1000)
    res4 = ebf.rpc('gdss', req4)
    assert res4[0][0] == 'ok'
    assert res4[0][1] == 1
    assert res4[0][2] == 'bar'
    
  7. Set the key 'foo' from table tab1, using a new timestamp 2:

    ## set - ok
    req5 = (Atom('do'), Atom('tab1'),
            [(Atom('set'), 'foo', 2, 'baz', 0, [])], [], 1000)
    res5 = ebf.rpc('gdss', req5)
    assert res5[0] == 'ok'
    
  8. Get the key 'foo' from table tab1, verifying both the new timestamp and new value:

    ## get - ok
    req6 = (Atom('do'), Atom('tab1'), [(Atom('get'), 'foo', [])], [], 1000)
    res6 = ebf.rpc('gdss', req6)
    assert res6[0][0] == 'ok'
    assert res6[0][1] == 2
    assert res6[0][2] == 'baz'
    
Client API: Thrift

“TBF” is a link:https://github.com/apache/thrift[Thrift protocol] defined by UBF contract xref:the-hibari-ubf-protocol-contract[]. This section attempts to describe the Hibari Thrift API which allows users to access Hibari with Thrift clients in any Thrift supported programming languages, and how to extend the API for application uses.

The Hibari Thrift API

The Hibari Thrift API is defined as Hibari Service in link:./misc-codes/hibari.thrift[hibari.thrift]. At the time this API was developed, only Thrift 0.4.0 is available to us. This version is our first attempt to adopt Thrift. Some of the functions and options are not yet supported.

Important

The Hibari Thrift API only supports Thrift 0.4.0 or above.

service Hibari {

   /**
    * Check connection availability / keepalive
    */
   oneway void keepalive()

   /**
    * Hibari Server Info
    */
   string info()

   /**
    * Hibari Description
    */
   string description()

   /**
    * Hibari Contract
    */
   string contract()

   /**
    * Add
    */
   HibariResponse Add(1: Add request)
       throws (1:HibariException ouch)

   /**
    * Replace
    */
   HibariResponse Replace(1: Replace request)
       throws (1:HibariException ouch)

   /**
    * Set
    */
   HibariResponse Set(1: Set request)
       throws (1:HibariException ouch)

   /**
    * Rename
    */
   HibariResponse Rename(1: Rename request)
       throws (1:HibariException ouch)

   /**
    * Delete
    */
   HibariResponse Delete(1: Delete request)
       throws (1:HibariException ouch)

   /**
    * Get
    */
   HibariResponse Get(1: Get request)
       throws (1:HibariException ouch)
   }

For each primitive utility function, it has exactly one input parameter. The parameter is an object that has a name matching its function. The object carries all mandatory and optional parameters to Hibari. This object could also be used to implement micro-transactions in the future.

Mapping UBF Contract Types to Thrift Types

You can find more details of the UBF / Thrift type conversion in (link:https://github.com/ubf/ubf-thrift[UBF-Thrift]).

Mapping UBF Contract to Thrift Service

Mapping UBF types to thrift primitives is different from mapping UBF contracts to service. Thrift mainly uses 2 different types to compose a request (struct and field).

If you are using Thrift to generate client code, you probably don’t need to worry about how the request being constructed. Visit link:http://wiki.apache.org/thrift/ThriftGeneration[Thrift Wiki] for the instruction to install Thrift and to generate client code. You will also need link:./misc-codes/hibari.thrift[hibari.thrift] to get started.

If you are interested in the UBF contract, the Hibari NTBF contract can be found in the file of ntbf_gdss_plugin.con.

Examples of using a Thrift client

Once you get the generated code, connecting to Hibari is easy. For example, adding the key 'fookey' to table tab1 with a value of 'Hello, world!' in the following 3 languages.

In Erlang:

-include("hibari_thrift.hrl").

% init
{ok, Client} = thrift_client:start_link("127.0.0.1", 7600, hibari_thrift),

% create the input parameter object
Request = #add{table=<<"tab1">>, key=<<"fookey">>, value=<<"Hello, world!">},

% send request
try
  HibariResponse = thrift_client:call(Client, 'Add', [Request]),
catch
  HibariException ->
    HibariException
end,

ok = thrift_client:close(Client).

In Java:

import com.hibari.rpc.*;

// init
TTransport transport = new TSocket("127.0.0.1", 7600);
TProtocol proto = new TBinaryProtocol(transport);
Hibari.Client client = new Hibari.Client(proto);
transport.open();

// create the input parameter object
Add request = new Add("tab1", ByteBuffer.wrap("fookey".getBytes()),
  ByteBuffer.wrap("Hello, world!".getBytes())))

// send request
try {
  HibariResponse response = client.Add(request);
} catch (HibariException e) {
  // ...
}

transport.close();

In python:

from hibari import Hibari

# init
transport = TSocket.TSocket('localhost', 7600)
transport.setTimeout(None)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = Hibari.Client(protocol)
transport.open()

# create the input parameter object
request = Add()
request.table = "tab1"
request.key = b"fookey"
request.value = b"Hello, world!"

# send request
response = client.Add(request)

transport.close()
Mapping TBF Contract Responses From Thrift Client

TBF only responses one of two generic types to all functions in Hibari Thrift API, HibariResponse or HibariException. One could expect a HibariResponse in an any successful cases. Otherwise a HibariException should be thrown.

Building Hibari from Source

This section describes the basic recipes to build the following items:

  • Hibari Release Package
  • Hibari Documentation
  • Erlang/OTP System
Required Third Party Software

Before getting started, review this checklist of tools and software. Please install and set up as needed.

Mandatory Items (Required for Building Hibari)

The following software is required in order to download Hibari and build a release package:

  • Git – http://git-scm.com/

    • Must be version 1.5.4 or newer.

      • 1.7.3.4 is the version most recently tested for Hibari.
    • If you haven’t yet done so, please configure your email address and name for Git:

      $ git config --global user.email "you@example.com"
      $ git config --global user.name "Your Name"
      
    • If you haven’t yet done so, you must sign up for a GitHub account – https://github.com/

      • Anonymous read-only access using the GIT protocol is default.
      • Team members with read-write access: be sure to add your SSH public key under your GitHub account.
  • Python – http://www.python.org

    • Required by Repo

    • Must be version 2.4 or newer

      • 2.7 is the version most recently tested for Hibari.

      Caution

      Python 3.x might be too new.

  • Repo – http://source.android.com/source/git-repo.html

    • Install as follows:

      $ mkdir -p ~/bin
      $ curl http://commondatastorage.googleapis.com/git-repo-downloads/repo > ~/bin/repo
      $ chmod a+x ~/bin/repo
      
    • The downloading and packaging process also uses Rebar (https://github.com/basho/rebar/wiki) but this tool is included in the Hibari Git repositories so you do not need to install it separately.

  • OpenSSL – http://www.openssl.org/

    • Required for Erlang’s crypto module.
  • Erlang/OTP – http://www.erlang.org/

    • Must be version R16B01 or newer.
      • 17.4 is the version most recently tested for Hibari.
    • For information on building Erlang/OTP from source, see <<ErlangOTP>> in this document.
Optional Items (Required for Building Hibari’s Documentation)

The following software is required only if you want to build Hibari’s documentation from source. Note that an online version of the documentation is available at http://hibari.github.com/hibari-doc/.

Downloading Hibari

Follow these steps to download the Hibari repositories from GitHub.

  1. Create a working directory and retrieve the Hibari manifest files:

    $ mkdir working-directory
    $ cd working-directory
    $ repo init -u git://github.com/hibari/manifests.git -m hibari-default.xml
    

    Note

    Your “Git” identity is needed during the repo init step. Please enter the name and email of your GitHub account if you have one. Team members having read-write access should use repo init -u git@github.com:hibari/manifests.git -m hibari-default-rw.xml.

    Tip

    If you want to checkout the latest development version of Hibari, please append `` -b dev`` to the repo init command.

  2. Download Hibari’s Git repositories:

    $ Repo sync
    

    After the repo sync, your working directory has the following structure:

    <working-directory>
     |- hibari/
       |- .git/
       |- .gitignore
       |- Makefile
       |- dialyze-ignore-warnings.txt
       |- dialyze-nospec-ignore-warnings.txt
       |- lib/                             <1>
         |- <application_name>/
           |- .git/
           |- .gitignore
           |- ebin/
           |- include/
             |- *.hrl
           |- priv/
           |- rebar.config
           |- src/
             |- <application_name>.app.src
             |- *.erl
           |- test/
             |- eunit/
               |- *.erl
             |- eqc/
               |- *.erl
         :
       |- rebar
       |- rebar.config
       |- rel/                             <2>
         |- files/
           |- app.config
           |- erl
           |- hibari
           |- hibari-admin
           |- nodetool
           |- nodetool-admin
           |- vm.args
         |- hibari/
           :
           |- releases/
             |- <release_vsn>/
               :
             :
           :
         |- reltool.config
     |- hibari-doc/                        <3>
       :
     |- manifests/                         <4>
       :
     |- patches/                           <5>
       :
     |- rebar/                             <6>
       :
     |- .repo/
       :
    

<1> Applications <2> Releases <3> Documentation <4> Manifests <5> Patches <6> Rebar

Building the Hibari Release Package

Follow these steps to build a Hibari release package.

  1. Building basic recipe:

    $ cd working-directory/hibari
    $ make
    

Tip

If the response is “make: erl: Command not found”, please make sure Erlang/OTP is installed and “otp-installing-directory-name/bin” is added to your $PATH environment.

  1. Release packaging basic recipe:

    $ cd working-directory/hibari
    $ make package
    

Note

A release package tarball “hibari-X.Y.Z-dev-ARCH-WORDSIZE.tgz” and md5sum file “hibari-X.Y.Z-dev-ARCH-WORDSIZE-md5sum.txt” is written into your working-directory. You can then use these files to perform a single-node or multi-node Hibari installation as described in <<getting-started>>.

[[HibariAsciiDoc]]

Building Hibari’s Documentation

Follow these steps to build Hibari’s documentation.

  1. Building Hibari’s “Guides” basic recipe:

    $ cd working-directory/hibari-doc/src/hibari
    $ make clean -OR- make realclean
    $ make
    
  2. Building Hibari’s “Website” basic recipe:

    $ cd working-directory/hibari-doc/src/hibari/website
    $ make clean -OR- make realclean
    $ make
    

Note

HTML documentation is written in the ”./public_html” directory.

Hibari’s documentation is authored using AsciiDoc and a few auxillary tools:

  • ImageMagick
  • dblatex
  • Dia
  • Graphviz
  • Mscgen
  • w3m

Hibari’s documentation is generated with AsciiDoc and a manually modified version of the a2x tool. A fake lang-ja.conf file can be easily created by making a symlink to the lang-en.conf file.

diff -r -u 8.6.4-orig/bin/a2x.py 8.6.4/bin/a2x.py
--- 8.6.4-orig/bin/a2x.py    2011-04-24 00:50:26.000000000 +0900
+++ 8.6.4/bin/a2x.py 2011-04-24 00:35:55.000000000 +0900
@@ -156,7 +156,10 @@
  def shell_copy(src, dst):
    verbose('copying "%s" to "%s"' % (src,dst))
      if not OPTIONS.dry_run:
-        shutil.copy(src, dst)
+        try:
+            shutil.copy(src, dst)
+        except shutil.Error:
+            return

  def shell_rm(path):
      if not os.path.exists(path):
 Only in 8.6.4/etc/asciidoc: lang-ja.conf

[[ErlangOTP]]

Building and Installing Erlang/OTP

Follow these steps to download and build Erlang/OTP from source, and to install the system. These steps provide a basic recipe; not all options are addressed.

Note

Please make sure to have the ‘openssl-devel’ package installed on your system before configuring and building Erlang/OTP.

  1. Download the source code for your Erlang/OTP system:

    $ cd working-directory
    $ wget http://www.erlang.org/download/otp_src_R14B01.tar.gz
    
  2. Untar the source code for your Erlang/OTP system:

    $ tar -xzf otp_src_R14B01.tar.gz
    
  3. Configure Erlang/OTP:

    $ cd working-directory/otp_src_R14B01
    $ ./configure --prefix=otp-installing-directory-name
    
  4. Build Erlang/OTP:

    $ make
    
  5. Install Erlang/OTP:

    $ sudo make install
    

Caution

Please make sure “otp-installing-directory-name/bin” is added to your $PATH environment.

Contributing to Hibari
GitHub, Git, and Repo

to be added

List the working directories for all of Hibari’s “projects”:

$ repo forall -c "pwd"

Note

Each project has a corresponding Git repository and (default) revision. Check the “manifests/hibari-default.xml” file for details.

Start a new topic (e.g. new-topic-name) branch:

$ repo start new-topic-name `repo forall -c "pwd" | xargs echo`

Abandon an existing topic (e.g. topic-name) branch:

$ repo abandon topic-name `repo forall -c "pwd" | xargs echo`

Track and checkout the master branch:

$ repo forall -c "git branch --track master github/master"
$ repo forall -c "git checkout master"

Track and checkout the dev (i.e. Development) branch:

$ repo forall -c "git branch --track dev github/dev"
$ repo forall -c "git checkout dev"
Code, Branch, and Version Management

to be added

Documentation

to be added

Submitting Patches

to be added

Introduction

Hibari is a production-ready, distributed, key-value, big data store. In the emerging field of “NOSQL” solutions to today’s mass-scale data storage challenges, Hibari stands out for several reasons:

  • Hibari is the only open source key-value database to couple Erlang engineering with innovative chain replication technology. Erlang is the ideal programming foundation on which to build a robust, high-performance distributed storage solution. Chain replication delivers high throughput and availability without sacrificing data consistency.
  • Hibari is the only open source key-value database built to the exacting standards of the carrier-class telecom sector, and proven in multi-million user telecom production environments.
  • Hibari delivers a distinctive feature matrix that includes:
    • Per-table options for RAM+disk-based or disk-only value storage
    • Support for per-key expiration times and per-key custom meta-data
    • Support for multi-key atomic transactions, within range limits
    • A key timestamping mechanism that facilitates “test-and-set” type operations
    • Automatic data rebalancing as the system scales
    • Support for live code upgrades
    • Multiple client API implementations

This introductory chapter will briefly address the recent emergence of NOSQL solutions to the challenges posed by the “Big Data” era before turning to describe more fully the distinctive benefits that Hibari provides to developers, administrators, and users of data-intensive applications.

Why NOSQL?

The NOSQL “movement” is, first off, not an outright rejection of traditional relational database management systems (RDBMS) but rather a growing recognition that today’s data environment requires a diverse storage toolset that is “Not Only SQL (NOSQL)”. Relational and NOSQL data storage solutions should be viewed as complements, with each approach better suited toward different types of applications and services.

The main driver of NOSQL has been the proliferation of applications and services that must store and serve terabytes or petabytes of data, often while striving to guarantee “always-on” availability and low latencies for end users. Organizations in many market sectors are grappling with the advent of Big Data, including but not limited to:

  • Web properties – coping with the massive data requirements of search, e-commerce, social media, and user-generated content.
  • Telecoms – managing and analyzing network logs and call data records for multi-millions of subscribers.
  • Utilities – managing and analyzing the enormous data volume associated with smart grids.
  • Financial services – storing and mining customer history data in order to analyze and model risk.
  • Retail analytics – click-stream analysis and micro-targeting.
  • Biotech – genome analysis.

Organizations in these and other data-intensive environments have been challenged to build data storage systems of unprecedented scale. Many such organizations have found their needs ill-met by traditional data storage approaches that center around relational database management systems and specialized high-end hardware. In particular:

  • Scaling up a single RDBMS instance doesn’t achieve nearly the scale required, no matter how high-end the systems or how great the expenditure.
  • Scaling out by sharding the system over multiple RDBMS instances entails enormous costs and enormous operational complexity, while at the same time forfeiting much of the power of the relational model.

Wanting Big Data capacity without crippling cost and complexity, some innovative organizations have sought a better way to scale. At the same time, with an ever-expanding array of data usage scenarios, it’s become apparent that not all scenarios require the complex querying and management functionality associated with an RDBMS. For some applications and services, SQL-structuring and strict ACID properties are overkill. Worse, in some environments they’re expensive overkill that can potentially hamstring service offerings in highly competitive markets that demand flexibility and responsiveness.

In short, recent years have seen a proliferation of services that require more data, with less structure.

Not surprisingly, some of the leading web enterprises have been at the forefront of the NOSQL movement. In particular, Google with its http://labs.google.com/papers/bigtable.html[BigTable paper] in 2006 and Amazon with its http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf[Dynamopaper] in 2007 had a profound effect on the NOSQL market. A number of NOSQL solutions have drawn inspiration from either BigTable or Dynamo or both, and in the past couple years several solutions have been released into the open source community.

While NOSQL data storage solutions vary in their particulars, they have these basic traits in common:

  • A simplified data model. Data models vary across specific solutions, and sometimes form the basis of a tripartite classification of NOSQL systems into 1) key-value data stores (such as Dynamo and Hibari); 2) column-oriented data stores (such as BigTable); and 3) document-oriented data stores (such as CouchDB). All variants, however, are simpler and more flexible in data model than the traditional RDBMS. That simplification tends to carry over to client APIs as well.
  • Distribution across multiple nodes based on commodity PCs. Affordable Big Data capacity is achieved by scaling out across tens, hundreds, or even thousands of commodity PCs. Data partitioning schemes coupled with parallel processing of incoming requests deliver the needed high performance.
  • Replication of data objects across multiple nodes, to ensure high availability in the event of component failures.

For much more on the history, merits, and design issues associated with NOSQL storage solutions, consult with your favorite search engine.

Why Hibari?

Hibari was developed internally by Cloudian, Inc. (formerly Gemini Mobile Technologies), a leading producer of mass-scale messaging and transaction systems for Tier 1 mobile operators in Asia, Europe, and the Americas. Cloudian had need for a data store that was efficient, fast, flexible, and scalable, as well as robust enough to withstand the rigors of deployment in Tier 1 telecom production environments. Dissatisfied with the then-available options, Cloudian in 2005 began work on what came to be Hibari (the name is Japanese for skylark; the kanji characters stand for “cloud bird”).

With the system having in recent years matured and been proven in production, Cloudian released Hibari to the open source community in July 2010 under the Apache 2.0 license. Cloudian regards the open source community as the best venue in which Hibari can continue to perfect and grow.

This section describes some of the distinctive features that make Hibari a very attractive option for businesses and developers seeking a modern Big Data storage system:

  • link:#engineered-erlang[Engineered in Erlang]
  • link:#chain-replication[Chain Replication for High Availability and Strong Consistency]
  • link:#scalability[Easy, Affordable Scalability]
  • link:#high-performance[High Performance, Especially for Reads and Large Values]
  • link:#simple-powerful-api[Simple But Powerful Client API]
  • link:#production-proven[Production-Proven]
  • link:#hibari-benefits-by-user[Hibari Benefits for Developers, System Administrators, and Businesses]

[[engineered-erlang]]

Engineered in Erlang

Erlang is a general purpose programming language and runtime environment designed specifically to support reliable, high-performance distributed systems. Originally developed by Ericsson in the 1980s for building advanced telecom networking systems, Erlang/OTP (Open Telecom Platform) was open-sourced in 1998. Hibari is written entirely in Erlang.

Erlang provides a range of benefits that make it the ideal foundation for a distributed key-value storage solution:

  • Concurrency. Erlang has extremely lightweight processes that communicate by message passing and have no shared memory. Scheduling, memory management, and other concurrency-related services are managed by the Erlang VM, placing no requirements for concurrency on the host operating system.
  • Distribution. Erlang is designed specifically for distributed environments. Passing messages transparently via TCP, Erlang processes on different nodes communicate with each other in exactly the same way as do processes on the same node. The simple and efficient design facilitates massive parallelism and scalability of the sort required by a high-performance distributed storage system. With its prowess for concurrency and distributed processing, it has been suggested that Erlang can be regarded as a first-of-its-kind http://www.oreillygmt.eu/open-sourcefree-software/erlang-the-ceos-view/[“application system”], analogous to an operating system except running across and coordinating multiple hosts.
  • Robustness. Erlang processes are completely independent of each other, with no data sharing. While functionally isolated, Erlang processes are able to monitor each other and to detect and respond to crashed processes, even on remote nodes.
  • Portability. The same Erlang VM can run on Linux, Unix, Windows, Macintosh, or VxWorks. Distributed Erlang processes can seamlessly communicate with each other regardless of the heterogeneity of their host operating systems. This OS portability is a valuable facilitator of storage system elasticity, as system managers may need to mix and match hosts in response to fluid demand environments.
  • Hot code upgrades. Erlang-based applications like Hibari support hot code upgrades: upgrades can be applied without shutting down the system. During the change-over, old and new code can run simultaneously. This is a key benefit for environments that require “always-on” availability for end users.

Other features reinforce Erlang’s suitability for reliable distributed applications, including incremental garbage collection, single-assignment variables, and robust exception handling.

[[chain-replication]]

Chain Replication for High Availability and Strong Consistency

The Hibari distributed key-value store implements a version of the chain replication methodology first proposed by http://www.usenix.org/event/osdi04/tech/full_papers/renesse/renesse.pdf[van Renesse and Schneider] to achieve redundancy and high availability without sacrificing data consistency. At a high level, chain replication in a Hibari storage cluster works as follows:

  • Through consistent hashing, the key space is divided across multiple storage “chains”.
  • Each chain is composed of multiple logical storage “bricks”, with each brick running in its own Erlang VM instance.
  • Within each chain, the member bricks have differentiated roles. Client-requested updates to key-value pairs are written first to the “head” brick, then automatically replicated downstream to one or more “middle” bricks and finally to the “tail” brick, which returns an update acknowledgement to the client. By contrast, read requests are directed to the tail brick, which returns the response to the client.

image:images/chain_replication.png[]

While most distributed storage systems are able to guarantee only weak or eventual data consistency across replicas – placing the burden on the client application (and the client application developer) to manage the potential inconsistencies – Hibari with its chain replication implementation guarantees strong consistency. Data updates are considered complete, and are acknowledged to clients, only when they have replicated through the chain to the tail; and read requests are processed only by the tail. Consequently, after an object update is acknowledged to a Hibari client, other clients are guaranteed to see only the newest version of that object. This strong consistency is valuable in environments where ‘eventual consistency’ is at odds with the service level expected by end users, or where system designers do not want to clutter client applications with the logic required to manage data inconsistency.

The “length” of a chain is configurable and can be based on your desired degree of replication and redundancy. For example, a chain of length four would have a head brick, two middle bricks, and a tail brick; while a three-brick chain would have a head, one middle, and a tail. A chain can also operate at length two (a head and tail, with no middle) and even at length one (one brick playing both the head role and the tail role).

Because chains can operate at any length, and because the system is able to detect failures within the chain and to adjust member brick roles accordingly, Hibari delivers high availability as well as strong data consistency. For example, if in a three-brick chain the head brick goes down, the middle brick automatically takes over the head brick role, allowing the chain to continue functioning normally:

image:images/automatic_failover.png[]

If the new head brick failed also, the lone remaining brick would then play both the head role and the tail role, processing all writes and reads itself as a single-brick “chain”.

While multiple logical bricks can run on a single physical node, for high availability it is of course desirable that a particular chain’s member bricks be deployed on separate machines. If you want to run multiple bricks per machine and also ensure high availability for each chain, an attractive deployment option is to “stripe” the chains across machines:

image:images/load_balanced_chains.png[]

Note also that because head bricks (receiving incoming write requests) and tail bricks (replying to write requests and processing read requests) bear more load than do middle bricks, load balancing across machines can be achieved in part by allocating the different brick roles evenly, as in the diagram above.

In the event of a physical node failure, bricks within each impacted chain automatically shift roles, and each chain continues to provide normal service to clients:

image:images/automatic_failover_2.png[]

For further information about chain replication, fail-over, and recovery in a Hibari storage system, and for information about Hibari’s redundantly structured cluster membership application called the Admin Server, see these sections of the Hibari System Administrator’s Guide:

  • link:hibari-sysadmin-guide.en.html#hibari-architecture[Hibari Architecture]
  • link:hibari-sysadmin-guide.en.html#life-of-brick[The Life of a (Logical) Brick]
  • link:hibari-sysadmin-guide.en.html#dynamic-cluster-reconfiguration[Dynamic Cluster Reconfiguration]
  • link:hibari-sysadmin-guide.en.html#admin-server-app[The Admin Server Application]

[[scalability]]

Easy, Affordable Scalability

Hibari provides Big Data scalability while minimizing the cost and operational complexity of cluster growth:

  • Hibari scales horizontally by the addition of more chains, deployed on more physical nodes. The total storage and processing capacity of a Hibari cluster increases linearly as machines are added to the cluster.
  • The system rebalances data storage distribution automatically as chains are added to (or removed from) the cluster, with no downtime. You can grow (or shrink) your Hibari storage cluster with no service interruption.
  • Hibari runs on commodity PCs. Further, the system easily accommodates heterogeneous hardware resources. Bricks within the storage cluster can have different RAM and disk sizes, and different CPU speeds. You can tune Hibari’s consistent hash function to optimize your cluster’s utilization of mixed hardware. Each chain can be assigned a weighting factor that will increase or decrease that chain’s portion of the overall key space, relative to other chains.

In addition to supporting mixed hardware, Erlang-based Hibari can run on most any OS. With its easy adaptability to disparate hardware and operating systems, you can scale Hibari incrementally, with whatever resources you have available. It’s not necessary to buy all your resources at once, or all of the same kind.

Note

The outer limits of Hibari’s horizontal scalability have not been definitely determined, but 200 to 250 nodes is a practical boundary due to the limitations of Erlang’s built-in network distribution implementation. Also, while Hibari chains could theoretically be stretched across multiple data centers to provide geographic redundancy, to date only single data center deployments have been tested and used in production.

For further information on resizing a Hibari cluster, see link:hibari-sysadmin-guide.en.html#dynamic-cluster-reconfiguration[Dynamic Cluster Reconfiguration] in the Hibari System Administrator’s Guide.

[[high-performance]]

High Performance, Especially for Reads and Large Values

Several features work in combination to drive high performance in a Hibari storage cluster, even at Big Data scale:

  • The Erlang technology that underlies Hibari was specifically designed for and excels at distributed parallel processing.
  • Hibari’s implementation of consistent hashing and chain replication partitions the key-space across multiple chains, enabling parallel simultaneous processing of requests incoming to individual chains. The distribution of data across chains is tunable to allow optimal utilization of heterogeneous hardware resources.
  • Hibari’s chain replication implementation further aids performance by assigning storage bricks differentiated processing roles as head, middle, or tail. This division of labor particularly benefits read performance, as read requests are processed by “tail” bricks that do not bear the load of initial processing of write requests (since that work is done by “head” bricks).
  • Hibari supports a number of performance-tuning options on a per-table basis. For example, while some distributed KVDBs support only disk-based storage or only RAM-based storage of value blobs, Hibari lets you choose RAM+disk-based or disk-only storage on a per-table basis, depending on your application needs. Whichever storage option you choose, all data changes are logged to disk to ensure data durability in the event of power failures. A batch commit technique is used to minimize disk I/O.

Leveraging this feature set, Hibari is able to deliver scalable high performance that is competitive with leading open source NOSQL storage systems, while also providing the data durability and strong consistency that many systems lack. Hibari’s performance relative to other NOSQL systems is particularly strong for reads and for large value (> 200KB) operations. Hibari’s consistently high performance even for large values distinguishes the system from solutions that are tailored toward small value operations.

As one example of real-world performance, in a multi-million user webmail deployment equipped with traditional HDDs (non SSDs), Hibari is processing about 2,200 transactions per second, with read latencies averaging between 1 and 20 milliseconds and write latencies averaging between 20 and 80 milliseconds.

[[simple-powerful-api]]

Simple But Powerful Client API

As a key-value store, Hibari’s core data model and client API model are simple by design: blob-based key-value pairs can be inserted, retrieved, and deleted from lexicographically sorted tables. While Hibari thus provides the flexibility and scalability associated with key-value stores, the system also provides distinctive features that enhance the power of client applications and developers:

  • Clients can optionally assign per-object expiration times.
  • Clients can optionally assign per-object custom flags. This flexible, custom meta-data can be updated with or without updating the associated value blob, and can be retrieved with or without the value blob.
  • Objects are automatically timestamped each time they are updated. This timestamping mechanism facilitates “test-and-set” type operations: clients can specify that a requested operation be performed only if the target key’s timestamp matches the client’s expectations.
  • Within key-prefix range limits (specifically, within individual chains but not across chains), Hibari’s client API supports atomic transactions. This support for “micro-transactions” sets Hibari apart from other open source KVDBs and can greatly simplify the creation of robust client applications.

Hibari supports multiple client API implementations including:

  • Native Erlang
  • Universal Binary Format (UBF)
  • Thrift
  • Amazon S3
  • JSON-RPC

You can develop Hibari client applications in a variety of languages including Java, C/C++, Python, Ruby, and Erlang.

For further information about Hibari’s client API, see link:#client-api-erlang[Client API: Native Erlang] and the subsequent client API chapters in this guide.

Note

The Hibari source distribution does not include Amazon S3 and JSON-RPC. They are separate external projects.

[[production-proven]]

Production-Proven

While initial development work on Hibari was geared generally toward the data storage demands of the Tier 1 telecom sector, as the system evolved it needed to meet the requirements of a particular major Asian carrier that wished to launch a GB webmail service. This customer’s requirements for Hibari included the following:

  • Several million users from the start.
  • Several billion stored messages within a few months of launch.
  • Hundreds of TB storage capacity.
  • Elasticity to support continual growth.
  • Low system costs, particularly since the service would employ the “freemium” model.
  • Individual messages could range in size from a few bytes to many MB with attachments.
  • Support for per-object meta-data required.
  • Strong consistency required, for interactive sessions.
  • Data durability required – loss of messages or meta-data unacceptable.
  • High availability – an “always on”, branded service.
  • Low latency, with < 1 second response times for end user transactions.

Hibari was built to meet these rigorous requirements, was hardened through extensive testing and trials, and went live in support of this large-scale webmail system at the beginning of 2010. The system now stores billions of messages on behalf of millions of end users, while meeting customer requirements for availability, latency, consistency, durability, and affordability.

Coinciding with Hibari’s development and fine tuning for this GB webmail service, the system was also deployed as a storage solution for two major Asian carriers’ mobile social networking services. In this context, Hibari stores user profile data as well as digital goods of varying types and sizes.

[[hibari-benefits-by-user]]

Hibari Benefits for Developers, System Administrators, and Businesses

For application developers, Hibari offers a distinctive set of benefits not often found in distributed key-value stores:

  • Strong data consistency guarantees that relieve client applications of the burden of managing potential inconsistencies.
  • Micro-transaction support that simplifies the creation of powerful applications.
  • Per-object custom flags that facilitate flexible, service-specific application design.
  • Support for a variety of API implementations and development languages.

For system administrators, Hibari provides valuable operational automations that simplify data management in a dynamic storage environment:

  • Automatic data replication.
  • Automatic failover when a node goes down.
  • Automatic repair when a failed node comes back up.
  • Automatic rebalancing of data as a cluster grows or shrinks.

For businesses as a whole, Hibari offers affordable Big Data scalability while delivering the high availability and low latencies that service users demand. Hibari is an appropriate storage solution for a range of data-intensive service scenarios including but not limited to large-scale messaging, social media, and archiving. Hibari offers particular value in environments that require strong data consistency and/or high performance across a variety of object types and sizes.

Getting Started

This section covers the following topics to help you get up and running with Hibari:

  • link:#system-requirements[System Requirements]
  • link:#required-software[Required Third Party Software]
  • link:#download-hibari[Downloading Hibari]
  • link:#installing-single-node[Installing a Single-Node Hibari System]
  • link:#starting-single-node[Starting and Stopping a Single-Node Hibari System]
  • link:#installing-multi-node[Installing a Multi-Node Hibari Cluster]
  • link:#starting-multi-node[Starting and Stopping a Multi-Node Hibari Cluster]
  • link:#creating-tables[Creating New Tables]

[[system-requirements]]

System Requirements

Hibari will run on any OS that the Erlang VM supports, which includes most Unix and Unix-like systems, Windows, and Mac OS X. See Implementation and Ports of Erlang from the official Erlang documentation for further information.

For guidance on hardware requirements in a production environment, see link:hibari-sysadmin-guide.en.html#brick-hardware[Notes on Brick Hardware] in the Hibari System Administrator’s Guide.

[[required-software]]

Required Third-Party Software

Hibari’s requirements for third party software depend on whether you’re doing a single-node installation or a multi-node installation.

Required Software for a Single-Node Installation:

The node on which you plan to install Hibari must have the following software:

Required Software for a Multi-Node Installation:

When you install Hibari on multiple nodes you will use an installer tool that simplifies the cluster set-up process. When you use this tool you will identify the hosts on which you want Hibari to be installed, and the tool will manage the installation of Hibari onto those target hosts. You can run the tool itself from one of your target Hibari nodes or from a different machine. There are distinct requirements for third party software on the “installer node” (the machine from which you run the installer tool) and on the Hibari nodes (the machines on which Hibari will be installed and run.)

Installer Node Required Software

The installer node must have the software listed below. If you are missing any of these items, you can use the provided links for downloads and installation instructions.

There are currently no known version requirements for Bash, Expect, Perl, or SSH.

Hibari Nodes Required Software

The nodes on which you plan to install Hibari must have the software listed below.

[[download-hibari]]

Downloading Hibari

Hibari is not yet available as a pre-built release. In the meanwhile, you can build Hibari from source. Follow the instructions in <<HibariBuildingSource>>, and then return to this section to continue the set-up process.

When you build Hibari your output is two files that you will later use in the set-up process:

  • A tarball package hibari-X.Y.Z-DIST-ARCH-WORDSIZE.tgz
  • An md5sum file hibari-X.Y.Z-DIST-ARCH-WORDSIZE-md5sum.txt

X.Y.Z is the release version, DIST is the release distribution, ARCH is the release architecture, and WORDSIZE is the release wordsize.

[[installing-single-node]]

Installing a Single-Node Hibari System

A single-node Hibari system will not provide data replication and redundancy in the way that a multi-node Hibari cluster will. However, you may wish to deploy a simple single-node Hibari system for testing and development purposes.

  1. Create a directory for running Hibari:

    $ mkdir running-directory
    
  2. Untar the Hibari tarball package that you created when you built Hibari from source:

    $ tar -C running-directory -xvf hibari-X.Y.Z-DIST-ARCH-WORDSIZE.tgz
    

Important

On your Hibari node, in the system’s /etc/sysctl.conf file, set vm.swappiness=1. Swappiness is not desirable for an Erlang VM.

[[starting-single-node]]

Starting and Stopping Hibari on a Single Node

Starting and Bootstrapping Hibari
  1. Start Hibari:

    $ running-directory/hibari/bin/hibari start
    
  2. If this is the first time you’ve started Hibari, bootstrap the system:

    $ running-directory/hibari/bin/hibari-admin bootstrap
    

The Hibari bootstrap process starts Hibari’s Admin Server on the single node and creates a single table “tab1” serving as Hibari’s default table. For information on creating additional tables, see link:#creating-tables[Creating New Tables].

Verifying Hibari

Do these quick checks to verify that your single-node Hibari system is up and running.

  1. Confirm that you can open the “Hibari Web Administration” page:

    $ your-favorite-browser http://127.0.0.1:23080
    
  2. Confirm that you can successfully ping the Hibari node:

    $ running-directory/hibari/bin/hibari ping
    

IMPORTANT: A single-node Hibari system is hard-coded to listen on the localhost address 127.0.0.1. Consequently the Hibari node is reachable only from the node itself.

Stopping Hibari

To stop Hibari:

$ running-directory/hibari/bin/hibari stop

[[installing-multi-node]]

Installing a Multi-Node Hibari Cluster

Before you install Hibari on to the target nodes you must complete these preparation steps:

  • Set up required user privileges on the installer node and on the target Hibari nodes.
  • Download the Cluster installer tool.
  • Configure the Cluster installer tool.
Setting Up Your User Privileges

The system user ID that you use to perform the installation must be different than the Hibari runtime user. Your installing user account ($USER) must be set up as follows:

  • $USER must exist on the installer node and also on the target Hibari nodes.
  • $USER on the installer node must have SSH private/public keys, with the SSH agent set up to enable password-less SSH login.
  • $USER account must be accessible with password-less SSH login on the target Hibari nodes.
  • $USER must have password-less sudo access on the target Hibari nodes.

If your installing user account does not currently have the above privileges, follow these steps:

  1. As the root user, add your installing user ($USER) to the installer node. Then on each of the Hibari nodes, add your installing user and grant your user password-less sudo access:

    $ useradd $USER
    $ passwd $USER
    $ visudo
    # append the following line and save it
    $USER  ALL=(ALL)       NOPASSWD: ALL
    

Note

If you get a “sudo: sorry, you must have a tty to run sudo” error while testing sudo, try commenting out following line inside of the /etc/sudoers file:

$ visudo
Defaults    requiretty
  1. On the installer node, create a new SSH private/public key for your installing user:

    $ ssh-keygen
    # enter your password for the private key
    $ eval `ssh-agent`
    $ ssh-add ~/.ssh/id_rsa
    # re-enter your password for the private key
    
  2. On each of the Hibari nodes:

  • Append an entry for the installer node to the ~/.ssh/known_hosts file.
  • Append an entry for your public SSH key to the ~/.ssh/authorized_keys file.

In the example below, the target Hibari nodes are dev1, dev2, and dev3:

$ ssh-copy-id -i ~/.ssh/id_rsa.pub $USER@dev1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub $USER@dev2
$ ssh-copy-id -i ~/.ssh/id_rsa.pub $USER@dev3

Note

If your installer node will be one of the Hibari cluster nodes, make sure that you ssh-copy-id to the installer node also.

  1. Confirm that password-less SSH access to the each of the Hibari nodes works as expected:

    $ ssh $USER@dev1
    $ ssh $USER@dev2
    $ ssh $USER@dev3
    

Tip

If you need more help with SSH set-up, check http://inside.mines.edu/~gmurray/HowTo/sshNotes.html.

[[download-cluster]]

Downloading the Cluster Installer Tool

“Cluster” is a simple tool for installing, configuring, and bootstrapping a cluster of Hibari nodes. The tool is not part of the Hibari package itself, but is available from GitHub.

Note

The Cluster tool should meet the needs of most users. However, this tool’s “target node” recipe is currently Linux-centric (e.g. useradd, userdel, ...). Patches and contributions for other OS and platforms are welcome. For non-Linux deployments, the Cluster tool is rather simple so installation can be done manually by following the tool’s recipe.

  1. Create a working directory into which you will download the Cluster installer tool:

    $ mkdir working-directory
    
  2. Download the Cluster tool’s Git repository from GitHub:

    $ cd working-directory
    $ git clone git://github.com/hibari/clus.git
    

The download creates a sub-directory clus under which the installer tool and various supporting files are stored.

[[config-cluster]]

Configuring the Cluster Installer Tool

The Cluster tool requires some basic configuration information that indicates how you want your Hibari cluster to be set up. You will create a simple text file that specifies your desired configuration, and then later use the file as input when you run the Cluster tool.

It’s simplest to create the file in the same working directory in which you downloaded the cluster tool. You can give the file any name that you want; for purposes of these instructions we will use the file name hibari.config.

Below is a sample hibari.config file. The file that you create must include all of these parameters, and the values must be formatted in the same way as in this example (with parentheses and quotation marks as shown). Parameter descriptions follow the example file.

ADMIN_NODES=(dev1 dev2 dev3)
BRICK_NODES=(dev1 dev2 dev3)
BRICKS_PER_CHAIN=2

ALL_NODES=(dev1 dev2 dev3)
ALL_NETA_ADDRS=("10.181.165.230" "10.181.165.231" "10.181.165.232")
ALL_NETB_ADDRS=("10.181.165.230" "10.181.165.231" "10.181.165.232")
ALL_NETA_BCAST="10.181.165.255"
ALL_NETB_BCAST="10.181.165.255"
ALL_NETA_TIEBREAKER="10.181.165.1"

ALL_HEART_UDP_PORT="63099"
ALL_HEART_XMIT_UDP_PORT="63100"

[[eligible-admin-nodes]]

  • ADMIN_NODES
    • Host names of the nodes that will be eligible to run the Hibari Admin Server. For complete information on the Admin Server, see link:hibari-sysadmin-guide.en.html#admin-server-app[The Admin Server Application] in the Hibari System Administrator’s Guide.
  • BRICK_NODES
    • Host names of the nodes that will serve as Hibari storage bricks. Note that in the sample configuration file above there are three storage brick nodes (dev1, dev2, and dev3), and these three nodes are each eligible to run the Admin Server.
  • BRICKS_PER_CHAIN
    • Number of bricks per replication chain. For example, with two bricks per chain there will be two copies of the data stored in the chain (one copy on each brick); with three bricks per chain there will be three copies, and so on. For an overview of chain replication, see link:#chain-replication[Chain Replication for High Availability and Strong Consistency] in this document. For chain replication detail, see the Hibari System Administrator’s Guide.
  • ALL_NODES
    • This list of all Hibari nodes is the union of ADMIN_NODES and BRICK_NODES.
  • ALL_NETA_ADDRS
    • As described in link:hibari-sysadmin-guide.en.html#partition-detector[The Partition Detector Application] in the Hibari System Administrator’s guide, the nodes in a multi-node Hibari cluster should be connected by two networks, Network A and Network B, in order to detect and manage network partitions. The ALL_NETA_ADDRS parameter specifies the IP addresses of each Hibari node within Network A, which is the network through which data replication and other Erlang communications will take place. The list of the IP addresses should correspond in order to host names you listed in the ALL_NODES setting.
  • ALL_NETB_ADDRS
    • IP addresses of each Hibari node within Network B. Network B is used only for heartbeat broadcasts that help to detect network partitions. The list of the IP addresses should correspond in order to host names you listed in the ALL_NODES setting.
  • ALL_NETA_BCAST
    • IP broadcast address for Network A.
  • ALL_NETB_BCAST
    • IP broadcast address for Network B.
  • ALL_NETA_TIEBREAKER
    • Within Network A, the IP address for the network monitoring application to use as a “tiebreaker” in the event of a partition. If the network monitoring application on a Hibari node determines that Network A is partitioned and Network B is not partitioned, then if the Network A tiebreaker IP address responds to a ping, then the local node is on the “correct” side of the partition. Ideally the tiebreaker should be the address of the Layer 2 switch or Layer 3 router that all Erlang network distribution communications flow through.
  • ALL_HEART_UDP_PORT
    • UDP port for heartbeat listener.
  • ALL_HEART_XMIT_UDP_PORT
    • UDP port for heartbeat transmitter.

For more detail on network monitoring configuration settings, see the partition-detector’s OTP application source file (https://github.com/hibari/partition-detector/raw/master/src/partition_detector.app.src).

CAUTION: In a production setting, Network A and Network B should be physically different networks and network interfaces. However, for testing and development purposes the same physical network can be used for Network A and Network B (as in the sample configuration file above).

As final configuration steps, on each Hibari node:

  • Make sure that the /etc/hosts file has entries for all Hibari nodes in the cluster. For example:

    10.181.165.230  dev1.your-domain.com    dev1
    10.181.165.231  dev2.your-domain.com    dev2
    10.181.165.232  dev3.your-domain.com    dev3
    
  • In the system’s /etc/sysctl.conf file, set vm.swappiness=1. Swappiness is not desirable for an Erlang VM.

Installing Hibari

From your installer node, logged in as the installer user, take these steps to create your Hibari cluster:

  1. In the working directory in which you link:#download-cluster[downloaded the Cluster tool] and link:#config-cluster[created your cluster configuration file], place a copy of the Hibari tarball package and md5sum file:

    $ cd working-directory
    $ ls -1
    clus
    hibari-X.Y.Z-DIST-ARCH-WORDSIZE-md5sum.txt
    hibari-X.Y.Z-DIST-ARCH-WORDSIZE.tgz
    hibari.config
    $
    
  2. Create the “hibari” user on all Hibari nodes:

    $ for i in dev1 dev2 dev3 ; do ./clus/priv/clus.sh -f init hibari $i ; done
    hibari@dev1
    hibari@dev2
    hibari@dev3
    

Note

If the “hibari” user already exists on the target nodes, the -f option will forcefully delete and then re-create the “hibari” user.

  1. Install the Hibari package on all Hibari nodes, via the newly created “hibari” user:

    $ ./clus/priv/clus-hibari.sh -f init hibari hibari.config hibari-X.Y.Z-DIST-ARCH-WORDSIZE.tgz
    hibari@dev1
    hibari@dev2
    hibari@dev3
    

Note

By default the Cluster tool installs Hibari into /usr/local/var/lib on the target nodes. If you prefer a different location, before doing the install open the clus.sh script (in your working directory, under /clus/priv/) and edit the CT_HOMEBASEDIR variable.

[[starting-multi-node]]

Starting and Stopping a Multi-Node Hibari Cluster

You can use the Cluster installer tool to start and stop your multi-node Hibari cluster, working from the same node from which you managed the installation process. Note that in each of the Hibari commands in this section you’ll be referencing the name of the link:#config-cluster[Cluster tool configuration file] that you created during the installation procedure.

Starting and Bootstrapping the Hibari Cluster
  1. Change to the working directory in which you downloaded the Cluster tool, then start Hibari on all Hibari nodes via the “hibari” user:

    $ cd working-directory
    $ ./clus/priv/clus-hibari.sh -f start hibari hibari.config
    hibari@dev1
    hibari@dev2
    hibari@dev3
    
  2. If this is the first time you’ve started Hibari, bootstrap the system via the “hibari” user:

    $ ./clus/priv/clus-hibari.sh -f bootstrap hibari hibari.config
    hibari@dev1 => hibari@dev1 hibari@dev2 hibari@dev3
    

The Hibari bootstrap process starts Hibari’s Admin Server on the first link:#eligible-admin-nodes[eligible admin node] and creates a single table “tab1” serving as Hibari’s default table. For information about creating additional tables, see link:#creating-tables[Creating New Tables].

Note

If bootstrapping fails due to “another_admin_server_running” error, please stop the other Hibari cluster(s) running on the network; or reconfigure the Cluster tool to assign link:#eligible-admin-nodes[Hibari heartbeat listener ports] that are not in use by another Hibari cluster or other applications and then repeat the cluster installation procedure.

Verifying the Hibari Cluster

Do these simple checks to verify that Hibari is up and running.

  1. Confirm that you can open the “Hibari Web Administration” page:

    $ your-favorite-browser http://dev1:23080
    
  2. Confirm that you can successfully ping each of your Hibari nodes:

    $ ./clus/priv/clus-hibari.sh -f ping hibari hibari.config
    hibari@dev1 ... pong
    hibari@dev2 ... pong
    hibari@dev3 ... pong
    
Stopping the Hibari Cluster

Stop Hibari on all Hibari nodes via the “hibari” user:

$ cd working-directory
$ ./clus/priv/clus-hibari.sh -f stop hibari hibari.config
ok
ok
ok
hibari@dev1
hibari@dev2
hibari@dev3

[[creating-tables]]

Creating New Tables

The simplest way to create a new table is via the Admin Server’s GUI. Open http://localhost:23080/ and click the “Add a table” link. In addition to the GUI, the hibari-admin tool can also be used to create a new table. See the hibari-admin tool for usage details.

Note

For information about creating tables using the administrative API, see the Hibari System Administrator’s Guide.

When adding a table through the GUI, you have these table configuration options:

  • Local
    • Boolean. If true, all bricks for storing the new table’s data will be created on the local node, i.e. the node that’s running the Admin Server. If false, then the “NodeList” field is used to specify which cluster nodes the new bricks should use.
  • BigData
    • Boolean. If true, value blobs will be stored on disk.
  • DiskLogging
    • Boolean. If true, all updates will be written to the write-ahead log for persistence. If false, bricks will run faster but at the expense of data loss in a cluster-wide power failure.
  • SyncWrites
    • Boolean. If true, all writes to the write-ahead log will be flushed to stable storage via the fsync(2) system call. If false, bricks will run faster but at the expense of data loss in a cluster-wide power failure.
  • VarPrefix
    • Boolean. If true, then a variable-length prefix of the key will be used as input for the consistent hashing function. If false, the entire key will be used.

Many applications can benefit from using a variable-length or fixed-length prefix hashing scheme. As an example, consider an application that maintains state for various users. The app wishes to use micro-transactions to update various keys (in the same table) related to that user. The table can be created to use VarPrefix=true, together with VarPrefixSeparator=47 (ASCII 47 is the forward slash character) and VarPrefixNumSeparator=2, to create a hashing scheme that will guarantee that keys /FooUser/summary and /FooUser/thing1 and /FooUser/thing9 are all stored by the same chain.

Note

The HTTP interface for creating tables does not expose the fixed-length key prefix scheme. The Erlang API must be used in this case.

  • VarPrefixSeparator
    • Integer. Define the character used for variable-length key prefix calculation. Note that the default value of ASCII 47 (the “/” character), or any other character, does not imply any UNIX/POSIX style file or directory semantics.
  • VarPrefixNumSeparators
    • Integer. Define the number of VarPrefixSeparator bytes, and all bytes in between, used for consistent hashing. If VarPrefixSeparator=47 and VarPrefixNumSeparators=3, then for a key such as /foo/bar/baz, the prefix used for consistent hashing will be /foo/bar/.
  • Bricks
    • Integer. If Local=true (see above), then this integer defines the total number of logical bricks that will be created on the local node. This value is ignored if Local=false.
  • BPC
    • Integer. Define the number of bricks per chain.

The algorithm used for creating chain -> brick mapping is based on a “striping” principle: enough chains are laid across bricks in a stripe-wise manner so that all nodes (aka physical bricks) will have the same number of logical bricks in head, middle, and tail roles. See the example in the Hibari System Administrator’s Guide of link:hibari-sysadmin-guide.en.html#3-chains-striped-across-3-bricks[3 chains striped across three nodes].

The Erlang API must be used to create tables with other chain layout patterns.

  • NodeList
    • Comma-separated string. If Local=false, specify the list of nodes that will run logical bricks for the new table. Each node in the comma-separated list should take the form NodeName@HostName. For example, use hibari1@machine-a, hibari1@machine-b, hibari1@machine-c to specify three nodes.
  • NumNodesPerBlock
    • Integer. If Local=false, then this integer will affect the striping behavior of the default chain striping algorithm. This value must be zero (i.e. this parameter is ignored) or a multiple of the BPC parameter.

For example, if NodeList contains nodes A, B, C, D, E, and F, then the following striping patterns would be used:

  • NumNodesPerBlock=0 would stripe across all 6 nodes for 6 chains total.
  • NumNodesPerBlock=2 and BPC=2 would stripe 2 chains across nodes A & B, 2 chains across C & D, and 2 chains across E & F.
  • NumNodesPerBlock=3 and BPC=3 would stripe 3 chains across nodes A & B & C and 3 chains across D & E & F.
  • BlockMultFactor
    • Integer. If Local=false, then this integer will affect the striping behavior of the default chain striping algorithm. This value must be zero (i.e. this parameter is ignored) or greater than zero.

For example, if NodeList contains nodes A, B, C, D, E, and F, then the following striping patterns would be used:

  • NumNodesPerBlock=0 and BlockMultFactor=0 would stripe across all 6 nodes for 6 chains total.
  • NumNodesPerBlock=2 and BlockMultFactor=5 and BPC=2 would stripe 2*5=10 chains across nodes A & B, 2*5=10 chains across C & D, and 2*5=10 chains across E & F, for a total of 30 chains.
  • NumNodesPerBlock=3 and BlockMultFactor=4 and BPC=3 would stripe 3*4=12 chains across nodes A & B & C and 3*4=12 chains across D & E & F, for a total of 24 chains.

The Hibari Data Model

If a Hibari table were represented within an SQL database, it would look something like this:

[[sql-definition-hibari]]

include::texts-src/hibari-sql-definition.txt[]

Hibari table names use the Erlang data type ``atom’‘. The types of all key-related attributes are presented below.

include::texts-src/hibari-key-value-attrs.txt[]

include::texts-src/hibari-key-value-attrs-expl.txt[]

The practical constraints on maximum value blob size are affected by total blob size and frequency of large blob access. For example, storing an occasional 64MB value blob is different than a 100% write workload of 100% 64MB value blobs. The Hibari client API does not have a method to update or fetch less than the entire value blob, so a brick can be blocked for many seconds if it tried to operate on (for example) even a single 4GB blob. In addition, other processes can be blocked by ‘busy_dist_port’ events while processing big value blobs.

Hibari Client API Overview

As a key-value database, Hibari provides a simple client API with primitive operations for inserting, retrieving, and deleting data. Within certain restrictions, the API also supports compound operations that optionally can be executed as atomic transactions.

Supported Operations

Hibari’s client API supports the operations listed below.

Data Insertion
brick_simple:add(Table, Key, Value[, ExpTime][, Flags][, Timeout])

Adds a key-value pair that does not yet exist, along with optional flags.

Successful adding of a new key-value pair:

> brick_simple:add(tab1, <<"foo">>, <<"Hello, world!">>).
{ok,1271542959131192}

Failed attempt to add a key that already exists:

> brick_simple:add(tab1, <<"foo">>, <<"Goodbye, world!">>).
{key_exists,1271542959131192}
brick_simple:replace(Table, Key, Value[, ExpTime][, Flags][, Timeout])

Assigns a new value and/or new flags to a key that already exists.

brick_simple:set(Table, Key, Value[, ExpTime][, Flags][, Timeout])

Sets a key-value pair and optional flags regardless of whether the key yet exists.

brick_simple:rename(Table, Key, NewKey[, ExpTime][, Flags][, Timeout])

Renames a key that already exists.

Successful renaming of a key-value pair:

> brick_simple:rename(tab1, <<"my/foo">>, <<"my/bar">>).
{ok,1271543165272987}

rename operation fails if key and newkey do not share a common key prefix:

> brick_simple:rename(tab1, <<"my/foo">>, <<"her/foo">>).
...

See TODO (Creating New Table - VarPrefix) for more details.

Data Retrieval
  • Retrieve a key and optionally its associated value and flags:
    • link:#brick-simple-get[brick_simple:get/4]
  • Retrieve multiple lexicographically contiguous keys and optionally their associated values and flags:
    • link:#brick-simple-get-many[brick_simple:get_many/5]
Data Deletion
  • Delete a key-value pair and associated flags:
    • link:#brick-simple-delete[brick_simple:delete/4]
Compound Operations
  • Execute a specified list of operations, optionally as an atomic transaction (micro-transaction):
    • link:#brick-simple-do[brick_simple:do/4]
Fold Operations
  • Implement a fold operation across all keys in a table:
    • link:#brick-simple-fold-table[brick_simple:fold_table/7]
  • Implement a fold operation across all keys having a specified prefix:
    • link:#brick-simple-fold-key[brick_simple:fold_key_prefix/9]

Note

Fold operations are performed at client side, not server side.

Check and Swap (CAS)

If desired, clients can apply a “check and swap” (or “test and set”) logic to data insertion, retrieval, and deletion operations so that the operation will be executed only if the target key has the exact timestamp specified in the request.

Micro-Transaction

TODO

Client API: Native Erlang

Data Insertion

  • Add a key-value pair that does not yet exist, along with optional flags:

    • link:#brick-simple-add[brick_simple:add/6]
  • Assign a new value and/or new flags to a key that already exists:

    • link:#brick-simple-replace[brick_simple:replace/6]
  • Rename a key that already exists:

    • link:#brick-simple-rename[brick_simple:rename/6]
  • Set a key-value pair and optional flags regardless of whether the key yet exists:

    • link:#brick-simple-set[brick_simple:set/6]

Data Retrieval

  • Retrieve a key and optionally its associated value and flags:
    • link:#brick-simple-get[brick_simple:get/4]
  • Retrieve multiple lexicographically contiguous keys and optionally their associated values and flags:
    • link:#brick-simple-get-many[brick_simple:get_many/5]

Data Deletion

  • Delete a key-value pair and associated flags:
    • link:#brick-simple-delete[brick_simple:delete/4]

Compound Operations

  • Execute a specified list of operations, optionally as an atomic transaction (micro-transaction):
    • link:#brick-simple-do[brick_simple:do/4]

If desired, clients can apply a “test ‘n set” logic to data insertion, retrieval, and deletion operations so that the operation will be executed only if the target key has the exact timestamp specified in the request.

Fold Operations

  • Implement a fold operation across all keys in a table:
    • link:#brick-simple-fold-table[brick_simple:fold_table/7]
  • Implement a fold operation across all keys having a specified prefix:
    • link:#brick-simple-fold-key[brick_simple:fold_key_prefix/9]

Note

Fold operations are performed at client side, not server side.

brick_simple:add/6

Adds Key and Value pair (and optional Flags) to the table Table if the key does not already exist. The operation will fail if Key already exists.

brick_simple:add(Table, Key, Value)
brick_simple:add(Table, Key, Value, Flags)
brick_simple:add(Table, Key, Value, Timeout)
brick_simple:add(Table, Key, Value, ExpTime, Flags, Timeout)
Parameters:
  • Table (table()) –

    Name of the table to which to add the key-value pair

    • -type table() :: atom()
  • Key (key()) –

    Key to add to the table, in association with a paired value

    • -type key() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]

Note

While the Key may be specified as either iolist() or binary(), it will be converted into binary before operation execution. The same is true of Value.

Parameters:
  • Value (val()) –

    Value to associate with the key

    • -type val() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]
  • ExpTime (exp_time()) –
    • Time at which the key will expire, expressed as a Unix time_t().
    • Optional; defaults to 0 (no expiration).
    • -type exp_time() :: time_t()
    • -type time_t() :: integer()
  • Flags (flags_list()) –
    • List of operational flags to apply to the add operation, and/or custom property flags to associate with the key-value pair in the database. Heavy use of custom property flags is discouraged due to RAM-based storage
    • Optional; defaults to empty list
    • -type flags_list() :: [do_op_flag() | property()]
    • -type do_op_flag() :: 'value_in_ram'
      • Store the value blob in RAM, overriding the default storage location of the brick

        Note

        'value_in_ram' flag have not been extensively tested

    • -type property() :: atom() | {term(), term()}
  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success return

Return type:{'ok', timestamp()}

Error returns

Return type:{'key_exists', timestamp()}
  • The operation failed because the key already exists.
  • -type timestamp() :: integer()
Return type:'invalid_flag_present'
  • The operation failed because an invalid do_op_flag() was found in the Flags argument.
Return type:'brick_not_available'
  • The operation failed because the chain that is responsible for this key is currently length zero and therefore unavailable.
Return type:{{'nodedown',node()},{'gen_server','call',term()}}
  • The operation failed because the server brick handling the request has crashed or else a network partition has occurred between the client and server. The client should resend the query after a short delay, on the assumption that the Admin Server will have detected the failure and taken steps to repair the chain.
  • -type node() :: atom()
Examples

Successful adding of a new key-value pair:

> brick_simple:add(tab1, <<"foo">>, <<"Hello, world!">>).
{ok,1271542959131192}

Failed attempt to add a key that already exists:

> brick_simple:add(tab1, <<"foo">>, <<"Goodbye, world!">>).
{key_exists,1271542959131192}

Successful adding of a new key-value pair, with value to be stored in RAM regardless of brick’s default storage setting:

> brick_simple:add(tab1, "foo1", "this is value1", ['value_in_ram']).
{ok,1271542959131192}

Successful adding of a new key-value pair, using a non-default operation timeout:

> brick_simple:add(tab1, "foo2", "this is value2", 20000).
{ok,1271542959131192}

brick_simple:replace/6

Replace Key and Value pair (and optional Flags) in the table Table if the key already exists. The operation will fail if Key does not already exist

brick_simple:replace(Table, Key, Value)
brick_simple:replace(Table, Key, Value, Flags)
brick_simple:replace(Table, Key, Value, Timeout)
brick_simple:replace(Table, Key, Value, ExpTime, Flags, Timeout)
Parameters:
  • Table (table()) –

    Name of the table in which to replace the key-value pair.

    • -type table() :: atom()
  • Key

    Key to replace in the table, in association with a new paired value

    • -type key() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]

Note

While the Key may be specified as either iolist() or binary(), it will be converted into binary before operation execution. The same is true of Value.

Parameters:
  • Value (val()) –

    Value to associate with the key

    • -type val() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]
  • ExpTime (exp_time()) –
    • Time at which the key will expire, expressed as a Unix time_t().
    • Optional; defaults to 0 (no expiration).
    • -type exp_time() :: time_t()
    • -type time_t() :: integer()
  • Flags (flags_list()) –
    • List of operational flags to apply to the replace operation, and/or custom property flags to associate with the key-value pair in the database. Heavy use of custom property flags is discouraged due to RAM-based storage
    • Optional; defaults to empty list
    • -type flags_list() :: [do_op_flag() | property()]
    • -type do_op_flag() :: {'testset', timestamp()} | 'value_in_ram' {'exp_time_directive', 'keep' | 'replace'} | {'attrib_directive', 'keep' | 'replace'}
    • -type timestamp() = integer()
    • -type property() :: atom() | {term(), term()}
    • Operational flag usage
      • {'testset', timestamp()}
        • Fail the operation if the existing key’s timestamp is not exactly equal to timestamp(). If used inside a link:#brick-simple-do[micro-transaction], abort the transaction if the key’s timestamp is not exactly equal to timestamp()
      • {'exp_time_directive', 'keep' | 'replace'}
        • Default to 'replace'
        • Specifies whether the ExpTime is kept from the old key value pair or replaced with the ExpTime provided in the replace operation
      • {'attrib_directive', 'keep' | 'replace'}
        • Default to 'replace'
        • Specifies whether the custom properties are kept from the old key value pair or replaced with the custom properties provided in the replace operation
        • If kept, the custom properties remain unchanged. If you specify custom properties explicitly in the replace operation, Hibari adds them to the resulting key value pair
        • If replaced, all original custom properties are deleted, and then Hibari adds the custom properties in the replace operation to the resulting key value pair
      • 'value_in_ram'
        • Store the value blob in RAM, overriding the default storage location of the brick

        Note

        'value_in_ram' flag have not been extensively tested

  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success return

Return type:{'ok', timestamp()}

Error returns

Return type:'key_not_exists'
  • The operation failed because the key does not exist
  • -type timestamp() :: integer()
Return type:{'ts_error', timestamp()}
  • The operation failed because the {'testset', timestamp()} flag was used and there was a timestamp mismatch. The timestamp() in the return is the current value of the existing key’s timestamp.
  • timestamp() = integer()
Return type:'invalid_flag_present'
  • The operation failed because an invalid do_op_flag() was found in the Flags argument.
Return type:'brick_not_available'
  • The operation failed because the chain that is responsible for this key is currently length zero and therefore unavailable.
Return type:{{'nodedown',node()},{'gen_server','call',term()}}
  • The operation failed because the server brick handling the request has crashed or else a network partition has occurred between the client and server. The client should resend the query after a short delay, on the assumption that the Admin Server will have detected the failure and taken steps to repair the chain.
  • -type node() :: atom()
Examples

Successful replacement of a key-value pair:

> brick_simple:replace(tab1, <<"foo">>, <<"Goodbye, world!">>).
{ok,1271543165272987}

Failed attempt to replace a key that does not yet exist:

> brick_simple:replace(tab1, <<"key3">>, <<"new and improved value">>).
key_not_exist

Successful replacement of a key-value pair, with value to be stored in RAM regardless of brick’s default storage setting:

> brick_simple:replace(tab1, "foo", "You again, world!", ['value_in_ram']).
{ok,1271543165272987}

Failed attempt to replace a key for which we have incorrectly specified its current timestamp:

> brick_simple:replace(tab1, "foo", "Whole new value", [{'testset', 12345}]).
{ts_error,1271543165272987}

Successful replacement of a key-value pair for which we have correctly specified its current timestamp:

> brick_simple:replace(tab1, "foo", "Whole new value", [{'testset', 1271543165272987}]).
{ok,1271543165272988}

Successful replacement of a key-value pair, using a non-default operation timeout:

> brick_simple:replace(tab1, "foo", "Foo again?", 30000).
{ok,1271543165272989}

brick_simple:set/6

Set Key and Value pair (and optional Flags) in the table Table, regardless of whether or not the key already exists.

brick_simple:set(Table, Key, Value)
brick_simple:set(Table, Key, Value, Flags)
brick_simple:set(Table, Key, Value, Timeout)
brick_simple:set(Table, Key, Value, ExpTime, Flags, Timeout)
Parameters:
  • Table (table()) –

    Name of the table to which to set the key-value pair

    • -type table() :: atom()
  • Key (key()) –

    Key to set in to the table, in association with a paired value

    • -type key() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]

Note

While the Key may be specified as either iolist() or binary(), it will be converted into binary before operation execution. The same is true of Value.

Parameters:
  • Value

    Value to associate with the key

    • -type val() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]
  • ExpTime (exp_time()) –
    • Time at which the key will expire, expressed as a Unix time_t().
    • Optional; defaults to 0 (no expiration).
    • -type exp_time() :: time_t()
    • -type time_t() :: integer()
  • Flags (flags_list()) –
    • List of operational flags to apply to the set operation, and/or custom property flags to associate with the key-value pair in the database. Heavy use of custom property flags is discouraged due to RAM-based storage
    • Optional; defaults to empty list
    • -type flags_list() :: [do_op_flag() | property()]
    • -type do_op_flag() :: {'testset', timestamp()} | 'value_in_ram' | {'exp_time_directive', 'keep' | 'replace'} | {'attrib_directive', 'keep' | 'replace'}
    • -type timestamp() :: integer()
    • -type property() :: atom() | {term(), term()}
    • Operational flag usage
      • {'testset', timestamp()}
        • Fail the operation if the existing key’s timestamp is not exactly equal to timestamp(). If used inside a link:#brick-simple-do[micro-transaction], abort the transaction if the key’s timestamp is not exactly equal to timestamp(). Using this flag with set will result in an error if the key does not already exist or if the key exists but has a non-matching timestamp.
      • {'exp_time_directive', 'keep' | 'replace'}
        • Default to 'replace'
        • Specifies whether the ExpTime is kept from the old key value pair or replaced with the ExpTime provided in the replace operation
      • {'attrib_directive', 'keep' | 'replace'}
        • Default to 'replace'
        • Specifies whether the custom properties are kept from the old key value pair or replaced with the custom properties provided in the set operation
        • If kept, the custom properties remain unchanged. If you specify custom properties explicitly in the set operation, Hibari adds them to the resulting key value pair
        • If replaced, all original custom properties are deleted, and then Hibari adds the custom properties in the set operation to the resulting key value pair
      • 'value_in_ram'
        • Store the value blob in RAM, overriding the default storage location of the brick

        Note

        'value_in_ram' flag have not been extensively tested

  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success return

Return type:{'ok', timestamp()}

Error returns

Return type:'key_not_exists'
  • The operation failed because the {'testset', timestamp()} flag was used and key does not exist
  • -type timestamp() :: integer()
Return type:{'ts_error', timestamp()}
  • The operation failed because the {'testset', timestamp()} flag was used and there was a timestamp mismatch. The timestamp() in the return is the current value of the existing key’s timestamp.
  • timestamp() = integer()
Return type:'invalid_flag_present'
  • The operation failed because an invalid do_op_flag() was found in the Flags argument.
Return type:'brick_not_available'
  • The operation failed because the chain that is responsible for this key is currently length zero and therefore unavailable.
Return type:{{'nodedown',node()},{'gen_server','call',term()}}
  • The operation failed because the server brick handling the request has crashed or else a network partition has occurred between the client and server. The client should resend the query after a short delay, on the assumption that the Admin Server will have detected the failure and taken steps to repair the chain.
  • -type node() :: atom()
Examples

Successful setting of a key-value pair:

> brick_simple:set(tab1, <<"key4">>, <<"cool value">>).
{ok,1271542959131192}

Successful setting of a key-value pair, with value to be stored in RAM regardless of brick’s default storage setting:

> brick_simple:set(tab1, "goo", "value6", ['value_in_ram']).
{ok,1271542959131193}

Failed attempt to set a key-value pair, when we have used the testset flag but the key does not yet exist:

> brick_simple:set(tab1, "boo", "hoo", [{'testset', 1271543165272987}]).
key_not_exist

Successful setting of a key-value pair, when we have used the testset flag and the key does already exist and its timestamp matches our specified timestamp:

> brick_simple:set(tab1, "goo", "value7", [{'testset', 1271543165272432}]).
{ok,1271543165272433}

brick_simple:rename/6

Rename Key, Value pair, and Flags to NewKey in the table Table if the key already exists. The operation will fail if:

  • Key does not already exist
  • ... or Key and NewKey do not share a common key prefix. (See TODO (Creating New Table - VarPrefix) for more details)
brick_simple:rename(Table, Key, NewKey)
brick_simple:rename(Table, Key, NewKey, Flags)
brick_simple:rename(Table, Key, NewKey, Timeout)
brick_simple:rename(Table, Key, NewKey, ExpTime, Flags, Timeout)
Parameters:
  • Table (table()) –

    Name of the table to which to rename the key-value pair

    • -type table() :: atom()
  • Key (key()) –

    Key to rename in to the table, in association with a paired value

    • -type key() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]

Note

While the Key may be specified as either iolist() or binary(), it will be converted into binary before operation execution. The same is true of NewKey

Parameters:
  • NewKey

    NewKey in the table, in association with an existing paired value

    • -type val() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]
  • ExpTime (exp_time()) –
    • Time at which the key will expire, expressed as a Unix time_t().
    • Optional; defaults to 0 (no expiration).
    • -type exp_time() :: time_t()
    • -type time_t() :: integer()
  • Flags (flags_list()) –
    • List of operational flags to apply to the rename operation, and/or custom property flags to associate with the key-value pair in the database. Heavy use of custom property flags is discouraged due to RAM-based storage
    • Optional; defaults to empty list
    • -type flags_list() :: [do_op_flag() | property()]
    • -type do_op_flag() :: {'testset', timestamp()} | 'value_in_ram' | {'exp_time_directive', 'keep' | 'replace'} | {'attrib_directive', 'keep' | 'replace'}
    • -type timestamp() :: integer()
    • -type property() :: atom() | {term(), term()}
    • Operational flag usage
      • {'testset', timestamp()}
        • Fail the operation if the existing key’s timestamp is not exactly equal to timestamp(). If used inside a link:#brick-simple-do[micro-transaction], abort the transaction if the key’s timestamp is not exactly equal to timestamp().
      • {'exp_time_directive', 'keep' | 'replace'}
        • Default to 'keep'
        • Specifies whether the ExpTime is kept from the old key value pair or replaced with the ExpTime provided in the rename operation
      • {'attrib_directive', 'keep' | 'replace'}
        • Default to 'keep'
        • Specifies whether the custom properties are kept from the old key value pair or replaced with the custom properties provided in the rename operation
        • If kept, the custom properties remain unchanged. If you specify custom properties explicitly in the rename operation, Hibari adds them to the resulting key value pair
        • If replaced, all original custom properties are deleted, and then Hibari adds the custom properties in the rename operation to the resulting key value pair
      • 'value_in_ram'
        • Store the value blob in RAM, overriding the default storage location of the brick

        Note

        'value_in_ram' flag have not been extensively tested

  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success return

Return type:{'ok', timestamp()}

Error returns

Return type:'key_not_exists'
  • The operation failed because the key does not exist or because key and the new key are equal
  • -type timestamp() :: integer()
Return type:{'ts_error', timestamp()}
  • The operation failed because the {'testset', timestamp()} flag was used and there was a timestamp mismatch. The timestamp() in the return is the current value of the existing key’s timestamp.
  • timestamp() = integer()
Return type:'invalid_flag_present'
  • The operation failed because an invalid do_op_flag() was found in the Flags argument.
Return type:'brick_not_available'
  • The operation failed because the chain that is responsible for this key and the new key is currently length zero and therefore unavailable.
Return type:{{'nodedown',node()},{'gen_server','call',term()}}
  • The operation failed because the server brick handling the request has crashed or else a network partition has occurred between the client and server. The client should resend the query after a short delay, on the assumption that the Admin Server will have detected the failure and taken steps to repair the chain.
  • -type node() :: atom()
Examples

Successful renaming of a key-value pair:

> brick_simple:rename(tab1, <<"foo">>, <<"bar">>).
{ok,1271543165272987}

Failed attempt to rename a key that does not yet exist:

> brick_simple:rename(tab1, <<"key3">>, <<"bar">>).
key_not_exist

Successful renaming of a key-value pair, with value to be stored in RAM regardless of brick’s default storage setting:

> brick_simple:rename(tab1, "foo", "bar", ['value_in_ram']).
{ok,1271543165272987}

Failed attempt to rename a key for which we have incorrectly specified its current timestamp:

> brick_simple:rename(tab1, "foo", "bar", [{'testset', 12345}]).
{ts_error,1271543165272987}

Successful renaming of a key-value pair for which we have correctly specified its current timestamp:

> brick_simple:rename(tab1, "foo", "bar", [{'testset', 1271543165272987}]).
{ok,1271543165272988}

Successful renaming of a key-value pair, using a non-default operation timeout:

> brick_simple:rename(tab1, "foo", "bar", 30000).
{ok,1271543165272989}

brick_simple:get/4

From table Table, retrieve Key and specified attributes of the key (as determined by Flags).

brick_simple:get(Table, Key)
brick_simple:get(Table, Key, Flags)
brick_simple:get(Table, Key, Timeout)
brick_simple:get(Table, Key, Flags, Timeout)
Parameters:
  • Table (table()) –

    Name of the table from which to retrieve the key-value pair

    • -type table() :: atom()
  • Key (key()) –

    Key to retrieve from to the table

    • -type key() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]

Note

While the Key may be specified as either iolist() or binary(), it will be converted into binary before operation execution

Parameters:
  • Flags (flags_list()) –
    • List of operational flags to apply to the get operation.
    • Optional; defaults to empty list
    • -type flags_list() :: [do_op_flag()]
    • -type do_op_flag() :: 'get_all_attribs' | 'witness' | {'testset', timestamp()} | 'must_exist' | 'must_not_exist'
    • -type timestamp() :: integer()
    • Operational flag usage
      • 'get_all_attribs'
        • Return all attributes of the key. May be used in combination with the witness flag
      • 'witness'
        • Do not return the value blob in the result. This flag will guarantee that the brick does not require disk access to satisfy this request
      • {'testset', timestamp()}
        • Fail the operation if the key’s timestamp is not exactly equal to timestamp(). If used inside a link:#brick-simple-do[micro-transaction], abort the transaction if the key’s timestamp is not exactly equal to timestamp().
        • This flag has priority over the 'must_exist' and 'must_not_exist' flags
      • 'must_exist'
        • For use inside a link:#brick-simple-do[micro-transaction]: abort the transaction if the key does not exist
      • 'must_not_exist'
        • For use inside a link:#brick-simple-do[micro-transaction]: abort the transaction if the key exists. This flag may be useful when the relationship between two or more keys is important to the client application
  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success returns

Return type:{'ok', timestamp(), val()}
  • Success return when the get request uses neither the 'witness' flag nor the 'get_all_attribs' flag
  • -type timestamp() :: integer()
  • -type val() :: iodata()
  • -type iodata() :: iolist() | binary()
  • -type iolist()  :: [char() | binary() | iolist()]
Return type:{'ok', timestamp()}
  • Success return when the get uses 'witness' but not 'get_all_attribs'
Return type:{'ok', timestamp(), exp_time(), proplist()}
  • Success return when the get uses both 'witness' and 'get_all_attribs'
  • -type exp_time() :: time_t()
  • -type proplist() :: [property()]
  • -type property() :: atom() | {term(), term()}
Return type:{'ok', timestamp(), val(), exp_time(), proplist()}
  • Success return when the get uses 'get_all_attribs' but not 'witness'
  • -type exp_time() :: time_t()

Note

When a proplist() is returned, one of the properties in the list will always be {val_len, Size::integer()}, where Size is the size of the value blob in bytes

Error returns

Return type:'key_not_exist'
  • The operation failed because the key does not exist.
Return type:{'ts_error', timestamp()}
  • The operation failed because the {'testset', timestamp()} flag was used and there was a timestamp mismatch. The timestamp() in the return is the current value of the existing key’s timestamp.
Return type:'invalid_flag_present'
  • The operation failed because an invalid do_op_flag() was found in the Flags argument
Return type:'brick_not_available'
  • The operation failed because the chain that is responsible for this key is currently length zero and therefore unavailable.
Return type:{{'nodedown',node()},{'gen_server','call',term()}}
  • The operation failed because the server brick handling the request has crashed or else a network partition has occurred between the client and server. The client should resend the query after a short delay, on the assumption that the Admin Server will have detected the failure and taken steps to repair the chain.
  • -type node() :: atom()
Examples

Successful retrieval of a key-value pair:

> brick_simple:get(tab1, "goo").
{ok,1271543165272432,<<"value7">>}

Successful retrieval of a key without its associated value blob:

> brick_simple:get(tab1, "goo", ['witness']).
{ok,1271543165272432}

Failed attempt to retrieve a key that does not exist:

> brick_simple:get(tab1, "moo").
key_not_exist

brick_simple:get_many/5

Get many keys from a single chain in the table Table, up to a maximum of MaxNum keys. Keys are returned in lexicographic sorting order starting with the first key _after_ the key specified by the Key argument. The return list includes a boolean value indicating whether or not there are more keys after the last key of the return results.

Important

A single get_many() function call cannot be used to retrieve keys from across multiple storage chains. The consistent hash of Key will send the get_many operation to the tail brick in a single chain; all keys returned will come from that single brick only.

brick_simple:get_many(Table, Key, MaxNum)
brick_simple:get_many(Table, Key, MaxNum, Flags)
brick_simple:get_many(Table, Key, MaxNum, Timeout)
brick_simple:get_many(Table, Key, MaxNum, Flags, Timeout)
Parameters:
  • Table (table()) –

    Name of the table to which to retrieve the key-value pair

    • -type table() :: atom()
  • Key (key()) –

    Key after which to start the get_many retrieval, proceeding in lexicographic order with the first key after the specified Key

    • -type key() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]

Note

While the Key may be specified as either iolist() or binary(), it will be converted into binary before operation execution

Parameters:
  • MaxNum (integer()) – Maximum number of keys to return
  • Flags
    • List of operational flags to apply to the get_many operation.
    • Optional; defaults to empty list
    • -type flags_list() :: [do_op_flag()]
    • -type do_op_flag() :: 'get_all_attribs' | 'witness' | {'binary_prefix', binary()} | {'max_bytes', integer()} | {'max_num', integer()}
    • -type timestamp() :: integer()
    • -type property() :: atom() | {term(), term()}
    • Operational flag usage
  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success returns

Return type:

{ok, {[{key(), timestamp(), val()}], boolean()}}

  • Success return when the get_many request uses neither the 'witness' flag nor the 'get_all_attribs' flag
  • -type timestamp() :: integer()
  • -type val() :: iodata()
  • -type iodata() :: iolist() | binary()
  • iolist() :: [char() | binary() | iolist()]

Return type:

{ok, {[{key(), timestamp()}], boolean()}}

  • Success return when the get_many uses 'witness' but not 'get_all_attribs'

Return type:

{ok, {[{key(), timestamp(), exp_time(), proplist()}], boolean()}}

  • Success return when the get_many uses both 'witness' and 'get_all_attribs'
  • -type exp_time() :: time_t()
  • -type proplist() :: [property()]
  • property() :: atom() | {term(), term()}

Trype:

{ok, {[{key(), timestamp(), val(), exp_time(), proplist()}], boolean()}}

  • Success return when the get_many uses 'get_all_attribs' but not 'witness'
  • exp_time() :: time_t()

Note

The boolean at the end of the success return indicates whether or not the chain has more keys lexicographically after the last key in the return (true for yes, false for no). When a proplist() is returned, one of the properties in the list will always be {val_len, Size::integer()}, where Size is the size of the value blob in bytes.

Error returns

Return type:'invalid_flag_present'
  • The operation failed because an invalid do_op_flag() was found in the Flags argument.
Return type:'brick_not_available'
  • The operation failed because the chain that is responsible for this key is currently length zero and therefore unavailable.
Return type:{{'nodedown',node()},{'gen_server','call',term()}}
  • The operation failed because the server brick handling the request has crashed or else a network partition has occurred between the client and server. The client should resend the query after a short delay, on the assumption that the Admin Server will have detected the failure and taken steps to repair the chain.
  • -type node() :: atom()
Examples

Successful retrieval of all keys from a table that currently has only two keys. The boolean false’ indicates that there are no keys following the ``foo` key:

> brick_simple:get_many(tab1, "", 5).
{ok,{[{<<"another">>,1271543102911775,<<"yes!">>},
      {<<"foo">>,1271543165272987,<<"Foo again?">>}],
     false}}

Successful retrieval of all keys from a table that currently has only two keys, using the witness flag in the request:

> brick_simple:get_many(tab1, "", 5, ['witness']).
{ok,{[{<<"another">>,1271543102911775},
      {<<"foo">>,1271543165272987}],
     false}}

Successful retrieval of all keys from a table that currently has only two keys, using the get_all_attribs flag in the request.:

> brick_simple:get_many(tab1, "", 5).
{ok,{[{<<"another">>,1271543102911775,<<"yes!">>,0,[{val_len,4}]},
      {<<"foo">>,1271543165272987,<<"Foo again?">>,0,[{val_len,6}]}],
     false}}

brick_simple:delete/4

Delete key Key from the table Table. The operation will fail if Key does not already exist

brick_simple:delete(Table, Key)
brick_simple:delete(Table, Key, Flags)
brick_simple:delete(Table, Key, Timeout)
brick_simple:delete(Table, Key, Flags, Timeout)
Parameters:
  • Table (table()) –

    Name of the table from which to delete the key-value pair

    • -type table() :: atom()
  • Key (key()) –

    Key to delete from the table

    • -type key() :: iodata()
    • -type iodata() :: iolist() | binary()
    • -type iolist() :: [char() | binary() | iolist()]

Note

While the Key may be specified as either iolist() or binary(), it will be converted into binary before operation execution

Parameters:
  • Flags (flags_list()) –
    • List of operational flags to apply to the delete operation.
    • Optional; defaults to empty list
    • -type flags_list() :: [do_op_flag()]
    • -type do_op_flag() :: {'testset', timestamp()} | 'must_exist' | 'must_not_exist'
    • -type timestamp() :: integer()
    • Operational flag usage
      • {'testset', timestamp()}
        • Fail the operation if the existing key’s timestamp is not exactly equal to timestamp(). If used inside a link:#brick-simple-do[micro-transaction], abort the transaction if the key’s timestamp is not exactly equal to timestamp(). This flag has priority over the 'must_exist' and 'must_not_exist' flags
      • 'must_exist'
        • For use inside a link:#brick-simple-do[micro-transaction]: abort the transaction if the key does not exist
      • 'must_not_exist'
        • For use inside a link:#brick-simple-do[micro-transaction]: abort the transaction if the key exists. This flag may be useful when the relationship between two or more keys is important to the client application
  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success return

Return type:'ok'

Error returns

Return type:'key_not_exist'
  • The operation failed because the key does not exist
Return type:{'ts_error', timestamp()}
  • The operation failed because the {'testset', timestamp()} flag was used and there was a timestamp mismatch. The timestamp() in the return is the current value of the existing key’s timestamp.
  • timestamp() = integer()
Return type:'invalid_flag_present'
  • The operation failed because an invalid do_op_flag() was found in the Flags argument.
Return type:'brick_not_available'
  • The operation failed because the chain that is responsible for this key is currently length zero and therefore unavailable.
Return type:{{'nodedown',node()},{'gen_server','call',term()}}
  • The operation failed because the server brick handling the request has crashed or else a network partition has occurred between the client and server. The client should resend the query after a short delay, on the assumption that the Admin Server will have detected the failure and taken steps to repair the chain.
  • -type node() :: atom()
Examples

Successful deletion of a key and its associated value and attributes:

> brick_simple:delete(tab1, <<"foo">>).
ok

Failed attempt to delete a key that does not exist:

> brick_simple:delete(tab1, "key6").
key_not_exist

Failed attempt to delete a key for which we have incorrectly specified its current timestamp:

> brick_simple:delete(tab1, "goo", [{'testset', 12345}]).
{ts_error,1271543165272987}

Successful deletion of a key for which we have correctly specified its current timestamp:

> brick_simple:delete(tab1, "goo", [{'testset', 1271543165272987}]).
ok

Successful deletion of a key, using a non-default operation timeout:

> brick_simple:delete(tab1, "key3", 30000).
ok

brick_simple:do/4

Send a list of primitive operations to the table Table. They will be executed at the same time by a Hibari brick. If the first item in the OpList is brick_server:make_txn() then the list of operations is executed in the context of a micro-transaction: either all operations will be executed successfully or none will be executed.

We term these “micro”-transactions because they are subject to certain limitations that apply to all operations that use the brick_simple:do() API:

  • All impacted keys must be in the same table.
  • All impacted keys must be in the same chain.
  • All operations in the transaction must be sent in a single brick_simple:do() call. Unlike some other databases, it is not possible to request a transaction handle and to add operations to that transaction in an one-by-one, “ad hoc” manner.

For further information about micro-transactions, see link:hibari-sysadmin-guide.en.html#micro-transactions[Hibari System Administrator’s Guide, “Micro-Transactions” section].

brick_simple:do(Table, OpList)
brick_simple:do(Table, OpList, Timeout)
brick_simple:do(Table, OpList, OpFlags, Timeout)
Parameters:
  • Table (table()) –

    Name of the table in which to perform the operations

    • -type table() :: atom()
  • OpList (do_op_list()) –
    • List of primitive operations to perform. Each primitive is invoked using the brick_server:make_*() API
    • -type do_op_list() :: [do1_op()]
    • -type do1_op() ::
      • brick_server:make_add(Key, Value, ExpTime, Flags)
      • brick_server:make_replace(Key, Value, ExpTime, Flags)
      • brick_server:make_set(Key, Value, ExpTime, Flags)
      • brick_server:make_rename(Key, NewKey, ExpTime, Flags)
      • brick_server:make_get(Key, Flags)
      • brick_server:make_get_many(Key, Flags)
      • brick_server:make_delete(Key, Flags)
      • brick_server:make_txn()
        • Include brick_server:make_txn() as the first item in your OpList if you want the do operation to be executed as an atomic transaction
        • Note that the arguments for each primitive are the same as those for the primitives when they are executed on their own, with the exclusion of the Tab and Timeout arguments, both of which serve as arguments to the overall do operation rather than as arguments to the primitives. For example, an add on its own is brick_simple:add(Tab, Key, Value, ExpTime, Flags, Timeout), whereas in the context of a do operation an add primitive is brick_server:make_add(Key, Value, ExpTime, Flags)
        • For further information about each primitive, see link:#brick-simple-add[brick_simple:add/6], link:#brick-simple-replace[brick_simple:replace/6], link:#brick-simple-set[brick_simple:set/6], link:#brick-simple-rename[brick_simple:rename/6], link:#brick-simple-get[brick_simple:get/4], link:#brick-simple-get-many[brick_simple:get_many/5], and link:#brick-simple-delete[brick_simple:delete/4]
  • OpFlags (do_flags_list()) –
    • List of operational flags to apply to the overall do operation.
    • Optional; defaults to empty list
    • -type do_flags_list() :: [do_flag()]
    • -type do_flag() :: 'fail_if_wrong_role' | 'ignore_role'
    • Operational flag usage
      • 'fail_if_wrong_role'
        • If the ‘do’ operation is sent to the wrong brick in the target chain (e.g. a ‘read’ request mistakenly sent to the ‘head’ brick or a ‘write’ request mistakenly sent to the ‘tail’ brick), fail the transaction immediately. If this flag is not used, the default behavior is for the incorrect brick to forward the request to the correct brick
      • 'ignore_role'
        • If this flag is used, then whichever brick receives the request will reply to the request directly, regardless of the brick’s assigned role
  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success return

Return type:[do1_res_ok]
  • List of do1_res_ok, one for each primitive operation specified in the do request. Return list order corresponds to the order in which primitive operations are listed in the request’s OpList. Note that if the do request does not use transaction semantics, then some individual primitive operations may fail without the overall do operation failing
  • Within the return list, possible do1_res_ok returns to each individual primitive operation are the same as the possible returns that the primitive operation type could generate if it were executed on its own. For example, within the do operation’s success return list, the possible returns for a primitive add operation are the same as the returns described in the link:#brick-simple-add[brick_simple:add/6] section; potential returns to a primitive replace operation are the same as those described in the link:#brick-simple-replace[brick_simple:replace/6] section; and likewise for link:#brick-simple-set[set], likewise for link:#brick-simple-rename[rename], link:#brick-simple-get[get], link:#brick-simple-get-many[get_many], and link:#brick-simple-delete[delete].

Error returns

Return type:{txn_fail, [{integer(), do1_res_fail()}]}
  • Operation failed because transaction semantics were used in the do request and one or more primitive operations within the transaction failed. The integer() identifies the failed primitive operation by its position within the request’s OpList. For example, a 2 indicates that the second primitive listed in the request’s OpList failed. Note that this position identifier does not count the txn() specifier at the start of the OpList.
  • do1_res_fail() indicates the type of failure for the failed primitive operation. Possibilities are:
    • {'key_exists', timestamp()}
      • -type timestamp() :: integer()
    • 'key_not_exist'
    • {'ts_error', timestamp()}
    • 'invalid_flag_present'
Return type:'invalid_flag_present'
  • The operation failed because an invalid do_flag() was found in the do request’s OpFlags argument. Note this is a different error than an invalid flag being found within an individual primitive
Return type:'brick_not_available'
  • The operation failed because the chain that is responsible for this key is currently length zero and therefore unavailable
Return type:{{'nodedown',node()},{'gen_server','call',term()}}
  • The operation failed because the server brick handling the request has crashed or else a network partition has occurred between the client and server. The client should resend the query after a short delay, on the assumption that the Admin Server will have detected the failure and taken steps to repair the chain
  • -type node() :: atom()
Examples

Successful do operation adding two new keys to table tab1, without transaction semantics:

> brick_simple:do(tab1, [brick_server:make_add("foo3", "bar3"),
                         brick_server:make_add("foo4", "bar4")]).
[ok,ok]

Successful creation of two get primitives Do1` and ``Do2`, and their subsequent combination into a ``do request, without transaction semantics:

> Do1 = brick_server:make_get("foo").
{get,<<"foo">>,[]}
> Do2 = brick_server:make_get("foo2").
{get,<<"foo2">>,[]}
> brick_simple:do(tab1, [Do1, Do2]).
[{ok,1271543102911775,<<"Foo again?">>},key_not_exist]

Failed operation with transaction semantics. Because transaction semantics are used, the failure of the primitive Do2b causes the entire operation to fail:

> Do1b = brick_server:make_get("foo").
{get,<<"foo">>,[]}
> Do2b = brick_server:make_get("foo2", [must_exist]).
{get,<<"foo2">>,[must_exist]}
> brick_simple:do(tab1, [brick_server:make_txn(), Do1b, Do2b]).
{txn_fail,[{2,key_not_exist}]}

brick_simple:fold_table/7

Attempt a fold operation across all keys in a table. For general information about the Erlang fold function that underlies this operations, see http://www.erlang.org/doc/man/lists.html#foldl-3.

Important

Do not execute this operation while a data migration is being performed

brick_simple:fold_table(Table, Fun, Acc, NumItems, Flags)
brick_simple:fold_table(Table, Fun, Acc, NumItems, Flags, MaxParallel)
brick_simple:fold_table(Table, Fun, Acc, NumItems, Flags, MaxParallel, Timeout)
Parameters:
  • Table (table()) –

    Name of the table across which to perform the fold operation

    • -type table() :: atom()
  • Fun (fun_arity_2()) –

    Function to apply to successive elements of the list

    • -type fun_arity_2() :: fun(({ChainName, TupleFromGetMany}, Acc) -> Acc)
      • TupleFromGetMany is a single result tuple from a link:#brick-simple-get-many[brick_simple:get_many()] result. Its format can vary according to the Flags argument, which is passed as-is to a get_many() call. For example, if Flags = [], then TupleFromGetMany will match {Key, TS, Value}. If Flags = [witness], then TupleFromGetMany will match {Key, TS}
    • Acc
      • The accumulator term
  • Acc (term()) – Initial value of the accumulator term
  • NumItems (integer()) – Batch size used for get_many operations used by the fold function
  • Flags (flags_list()) –
    • List of operational flags to apply to the fold_table operation, The supported flags are the same as those for link:#brick-simple-get-many[brick_simple:get_many()]
    • -type flags_list() :: [do_op_flag() | property()]
    • -type do_op_flag() :: 'get_all_attribs' | 'witness' {'binary_prefix', binary()} | {'max_bytes', integer()}
    • -type property() :: atom() | {term(), term()}
    • Operational flag usage
      • 'get_all_attribs'
        • Return all attributes of each key. May be used in combination with the witness flag
    • 'witness'
      • Do not return the value blobs in the result. This flag will guarantee that the brick does not require disk access to satisfy this request
    • {'binary_prefix', binary()}
      • Return only keys that have a binary prefix that is exactly equal to binary()
    • {'max_bytes', integer()}
      • Return only as many keys as the sum of the sizes of their corresponding value blobs does not exceed integer() bytes
  • MaxParallel (integer()) –
    • If MaxParallel = 0, a true fold will be performed. If MaxParallel >= 1, then an independent fold will be performed on each chain, with up to MaxParallel number of folds running in parallel. The result from each chain fold will be returned to the caller as-is, i.e. will not be combined like in a “reduce” phase of a map-reduce cycle
    • Optional; defaults to 0
  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success return

Return type:{ok, Acc::term(), Iterations::integer()}

Error return

Return type:{error, Error::term(), Acc::term(), Iterations::integer()}
Examples

to be added

brick_simple:fold_key_prefix/9

For a binary key prefix Prefix, fold over all keys in table Table starting with StartKey, sleeping for SleepTime milliseconds between iterations and using Flags and NumItems as arguments to link:#brick-simple-get-many[brick_simple:get_many()]. For general information about the Erlang fold function that underlies this operations, see http://www.erlang.org/doc/man/lists.html#foldl-3.

Important

Do not execute this operation while a data migration is being performed

brick_simple:fold_key_prefix(Table, Prefix, Fun, Acc, Flags)
brick_simple:fold_key_prefix(Table, Prefix, StartKey, Fun, Acc, Flags, NumItems, SleepTime, Timeout)
Parameters:
  • Table (table()) –

    Name of the table in which to perform the fold operation

    • -type table() :: atom()
  • Prefix (binary()) – Key prefix for which to perform the fold operation
  • StartKey (binary()) –
    • Key at which to initiate the fold operation
    • Optional; defaults to equal your specified Prefix
  • Fun (fun_arity_2()) –

    Function to apply to successive elements of the list

    • -type fun_arity_2() :: fun(({ChainName, TupleFromGetMany}, Acc) -> Acc)
      • TupleFromGetMany is a single result tuple from a link:#brick-simple-get-many[brick_simple:get_many()] result. Its format can vary according to the Flags argument, which is passed as-is to a get_many() call. For example, if Flags = [], then TupleFromGetMany will match {Key, TS, Value}. If Flags = [witness], then TupleFromGetMany will match {Key, TS}
    • Acc
      • The accumulator term
  • Acc (term()) – Initial value of the accumulator term
  • Flags (flags_list()) –
    • List of operational flags to apply to the fold_key_prefix operation. The supported flags are the same as those for link:#brick-simple-get-many[brick_simple:get_many()], excluding the {'binary_prefix', binary()} flag. This flag is inappropriate since the key prefix is passed directly through the Prefix argument of brick_simple:fold_key_prefix()
    • -type flags_list() :: ['get_all_attribs' | 'witness' | {'max_bytes', integer()}]
    • Operational flag usage
      • 'get_all_attribs'
        • Return all attributes of each key. May be used in combination with the witness flag
      • 'witness'
        • Do not return the value blobs in the result. This flag will guarantee that the brick does not require disk access to satisfy this request
      • {'max_bytes', integer()}
        • Return only as many keys as the sum of the sizes of their corresponding value blobs does not exceed integer() bytes
  • NumItems (integer()) – Batch size used for get_many operations used by the fold function
  • SleepTime (integer()) –
    • Sleep time between interations, in milliseconds
    • Optional; defaults to 0
  • Timeout (timeout()) –
    • Operation timeout in milliseconds
    • Optional; defaults to 15000
    • -type timeout() :: integer() | 'infinity'

Success return

Return type:{ok, Acc::term(), Iterations::integer()}

Error return

Return type:{error, Error::term(), Acc::term(), Iterations::integer()}
Examples

to be added

Client API: UBF

link:http://github.com/ubf/ubf[The UBF protocol] is a formally-specified family of protocols that are supported by a large number of client languages. This section attempts to describe the layers of the UBF protocol stack, how to use the UBF client in Erlang and other languages, and how to use that client to access a Hibari storage cluster.

The Hibari source distribution includes UBF/EBF protocol support for the following languages:

  • Erlang, see xref:using-ubf-erlang-client[]
  • Java, see xref:using-ubf-java-client[]
  • Python, see xref:using-ubf-python-client[]

[[hibari-server-impl-of-ubf-proto-stack]]

The Hibari Server’s Implementation of the UBF Protocol Stack

UBF(A): Bottom Layer, transport and session protocol layer

This layer plays the same basic role as many other serialized data transport protocols that use TCP for host-to-host transport, such as link:http://en.wikipedia.org/wiki/Open_Network_Computing_Remote_Procedure_Call[ONC-RPC], link:http://en.wikipedia.org/wiki/IIOP[CORBA IIOP], link:http://en.wikipedia.org/wiki/Protocol_buffers[Protocol Buffers], and link:http://en.wikipedia.org/wiki/Thrift_(protocol)[Thrift].

Hibari servers support several of these session protocols on top of a TCP/IP transport protocol. The choice of session protocol is a matter of convenience and/or support for the application developer. Hibari should be as easy for an app developer to use Ruby and JSON-RPC as it is to use Python and Thrift or EBF.

  • UBF(A), Joe Armstrong’s original session layer protocol
  • EBF, the Erlang Binary Format. The session layer protocol is a thin, efficient that uses the Erlang BIFs term_to_binary() and binary_to_term() to serialize Erlang data terms. This protocol is very closely related to the link:http://bert-rpc.org/[BERT protocol].
  • JSON over TCP, also called JSF (the JavaScript Format). Erlang terms are encoded as link:http://en.wikipedia.org/wiki/JSON[JSON terms] and transmitted directly over a TCP transport. This protocol is not in common use but is easy to implement in the UBF server framework.
  • HTTP, the link:http://en.wikipedia.org/wiki/HTTP[Hypertext Transfer Protocol]. This protocol is used to support Hibari’s link:http://en.wikipedia.org/wiki/JSON-RPC[JSON-RPC] server.
  • link:http://en.wikipedia.org/wiki/Thrift_(protocol)[Thrift]. Similar to EBF, except that Thrift’s binary encoding is used for the wire protocol instead of UBF(A) or Erlang’s native wire formats.
  • link:http://en.wikipedia.org/wiki/Protocol_buffers[Protocol Buffers]. Similar to EBF, except that Google’s Protocol Buffers binary encoding is used for the wire protocol instead of UBF(A) or Erlang’s native wire formats. Hibari support is experimental (i.e. not yet implemented).
  • link:http://hadoop.apache.org/avro/docs/current/[Avro]. Similar to EBF, except that Avro’s binary encoding is used for the wire protocol instead of UBF(A) or Erlang’s native wire formats. Hibari support is experimental (i.e. not yet implemented).
UBF(B): Middle Layer, the “contract”

UBF(B) is a programming language for describing types in UBF(A) and protocols between clients and servers. UBF(B) is roughly equivalent to to Verified XML, XML-schemas, SOAP and WDSL.

This layer enforces a protocol “contract”, a formal specification of all data sent by the client and by the server. Any data that does not precisely conform to the protocol is rejected by the contract checker (which is embedded in the server). If the client wishes, it may also use the contract checker to validate data sent by the server, though this not commonly done.

UBF(C): Top Layer, the UBF Metaprotocol

The metaprotocol is used at the beginning of a UBF session to select one of the UBF(B) contracts that the TCP listener is capable of offering. At the moment, Hibari servers support only the “gdss” contract, but other contracts may be added in the future.

[[ubf-representation-of-strings]]

UBF representation of strings vs. binaries

The Erlang language does not have a data type specifically for strings. Instead, strings are typically represented as lists of integers (ASCII byte values) and/or binaries.

A UBF contract makes a distinction between a string, list, and binary. In the case of a string, UBF(A) encodes a string using the notation {'#S', "Hello, world!"} to represent the string “Hello, world!”.

This string encoding is cumbersome to use for developers; in Erlang, the ubf.hrl header file includes a macro ?S("Hello, world!") as a slightly less ugly shortcut. When using other languages, the 2-tuple and the atom '#S' would be created as any other 2-tuple and atom.

Fortunately, there is only one case where the string type is necessary: using the startSession metaprotocol command to start using the Hibari data server contract. An example will be shown below.

[[using-ubf-in-any-language]]

Steps for Using a UBF-based Protocol in Any Language

The steps to use a UBF-based protocol are the same in any language.

  1. Create a connection to the UBF server.
    • ... or the EBF server, or the JSON-RPC server, or the Thrift server, or the ....
  2. Use the UBF metaprotocol to start using the gdss contract, i.e. the Hibari server contract.
  3. Send one or more Hibari server queries and decode the respective server responses.
  4. Close the connection to the UBF server.

[[the-hibari-ubf-protocol-contract]]

The Hibari UBF Protocol Contract

The Hibari UBF Protocol contract can be found in the file ubf_gdss_plugin.con.

Note

See the Hibari source code for the most up-to-date version of this file. link:./misc-codes/ubf_gdss_plugin.con[This documentation has a copy of ubf_gdss_plugin.con], though it may be slightly out-of-date.

The names of the UBF types specified in the contract may differ slightly from the names of the types used in this document’s xref:client-api-erlang[]. For example, the UBF contract calls the key expiration time time exp_time(), while the type system in this document calls it expiry(). However, in all cases of slightly different names, the fundamental data type that both names use is the same: e.g. integer() for expiration time.

For each command, the UBF contract uses the following naming conventions:

  • CommandName_req() for the request sent from client -> server, e.g. set_req() for the set command.
  • CommandName_res() for the response sent from server -> client, e.g. set_res() for the set response.

The general form of a UBF RPC call is a tuple. The first element in the tuple is the name of the command, and the following elements are arguments for that command. The response can be any Erlang term, but the Hibari contract will only return the atom or tuple types.

The following is a mapping of UBF client request type to its Erlang API function, in alphabetical order.:

  • add_req() -> brick_simple:add(), see xref:brick-simple-add[].
  • delete_req() -> brick_simple:delete(), see xref:brick-simple-delete[].
  • do_req() -> brick_simple:do(), see xref:brick-simple-do[].
  • get_req() -> brick_simple:get(), see xref:brick-simple-get[].
  • get_many_req() -> brick_simple:get_many(), see xref:brick-simple-get-many[].
  • replace_req() -> brick_simple:replace(), see xref:brick-simple-replace[].
  • set_req() -> brick_simple:set(), see xref:brick-simple-set[].
  • rename_req() -> brick_simple:rename(), see xref:brick-simple-rename[].

[[using-ubf-erlang-client]]

Using the UBF Client Library for Erlang

Important

  1. When using the Erlang shell for experimentation & prototyping, that shell must have the path to the Erlang UBF client library in its search path. The easiest way to do this is to use the arguments -pz /path/to/ubf/library/ebin to your Erlang shell’s erl command.
  2. When writing code, the statement -include("ubf.hrl"). at the top of your source module to gain access to the ?S() macro. Due to limitations in the Erlang shell, macros cannot be used in the shell.

As outlined in xref:using-ubf-in-any-language[], the first step is to create a connection to a Hibari server. If the Hibari cluster has multiple nodes, then it doesn’t matter which one that you connect to: all nodes can handle any UBF request and will route the query to the proper brick.

  1. Create a connection to the UBF server (on “localhost” TCP port 7581):

    (asdf@bb3)54> {ok, P1, _} = ubf_client:connect("localhost", 7581, [{proto, ubf}], 5000).
    {ok,<0.139.0>,{'#S', "gdss_meta_server"}}
    

    The second step is to use the UBF metaprotocol to select the Hibari server, contract, called “gdss”, for all further commands for this connection.

    Tip

    The Hibari server contract is “stateless”. All replies terms from the ubf_client:rpc/2 function use the form {reply,ServerReply,UBF_StateName}. Because the Hibari server contract is stateless, the UBF_StateName will always be the atom none.

  2. Use the UBF metaprotocol to request the “gdss” contract:

    (asdf@bb3)55> ubf_client:rpc(P1, {startSession, {'#S', "gdss"}, []}).
    {reply,{ok,ok},none}
    

    Now that the UBF connection is set up, we can use it to set a key “foo”.

  3. Set the key “foo” in table tab1 with the value “foo val”, no expiration time, no flags, and a timeout of 5 seconds:

    (asdf@bb3)59> ubf_client:rpc(P1, {set, tab1, <<"foo">>, <<"foo val">>, 0, [], 5000}).
    {reply,ok,none}
    

    Note

    Note that the return value of both the set_req() (in the example above) and get_req() (in the example below) return the same types described in the xref:brick-simple-set[] and xref:brick-simple-get[], respectively.

    The only difference is that the ubf_client:rpc/2 function wraps the server’s reply in a 3-tuple: {reply,ServerReply,none}.

  4. Get the key “foo” in table tab1, timeout in 5 seconds:

    (asdf@bb3)66> ubf_client:rpc(P1, {get, tab1, <<"foo">>, [], 5000}).
    {reply,{ok,1273009092549799,<<"foo val">>},none}
    

    If the client sends a request that violates the contract, the server will tell you, as in this example.

  5. Send a contract-violating request:

    (asdf@bb3)89> ubf_client:rpc(P1, {bbb, 3000}).
    {reply,{clientBrokeContract,{bbb,3000},[]},none}
    

    When you are done with the connection, it is polite to close the connection explicitly. The server will quietly clean up its side of the connection if the client forgets to call or cannot call stop/1.

  6. Close the UBF connection:

    (asdf@bb3)92> ubf_client:stop(P1).
    ok
    

[[using-ubf-java-client]]

Using the UBF Client Library for Java

The source code for the UBF client library for Java is included in the UBF source repository at link:http://github.com/ubf/ubf[http://github.com/ubf/ubf], in the priv/java subdirectory.

Compiling the UBF client library for Java
  1. Please update your UBF client library code to the “master” branch for a date after 10 May 2010, or use the Git tag “v1.14” or later. Versions of the library before 10 May 2010 and tag “v1.14” have several bugs that will prevent the UBF client from working correctly.
  2. Change directory to the priv/java directory of the UBF client library source distribution.
  3. Run make.
  4. (Optional) Copy the class files in the classes subdirectory to a suitable directory for your Java development environment.
Compiling the UBF client library test program HibariTest.java
  1. Change directory to the gdss-ubf-proto/priv/java subdirectory in the Hibari source distribution.

  2. Edit the Makefile to change the UBF_CLASSES_DIR variable to point to the priv/java/classes subdirectory of the UBF package’s source code (or the subdirectory where those classes have been formally installed on your system).

  3. Run the following two make commands. The second assumes that the Hibari server’s UBF server is on the local machine, “localhost”:

    $ make HibariTest
    $ make run-HibariTest
    
  4. If the Hibari server is not running on the local machine, then run make -n run-HibariTest to show the java command that is used to run the test program. Cut-and-paste the command into your shell, then edit the last argument to specify the hostname of a Hibari server.

Examining the HibariTest.java test program

The main() function does three things:

  1. Create a new UBF connection to a Hibari server (hostname/IP address is specified in the first command line argument) and requests the gdss contract via the UBF metaprotocol.
  2. Run the small test cases in the test_hibari_basics() method.
  3. Close the UBF session and exit.
The ubf.HibariTest.main() method
public class HibariTest {

    public static void main(String[] args) throws Exception {
        Socket sock = null;
        UBFClient ubf = null;

        try {
            sock = new Socket(args[0], 7581);
            ubf = UBFClient.new_via_sock(new UBFString("gdss"), new UBFList(),
                    new FooHandler(), sock);
        } catch (Exception e) {
            System.out.println(e);
            System.exit(1);
        }

        test_hibari_basics(ubf);

        ubf.stopSession();
        System.out.println("Success, it works");
        System.exit(0);
    }
    /* ... */
 }

The test_hibari_basics() method performs the same basic UBF operations as the Python EBF demonstration script described in xref:using-ubf-python-client[]. Unlike the Python demo script, the demo program does not use the Hibari do() command but rather then single-operation commands like get( and set().

  1. Delete the key foo from table tab1:

    public static void test_hibari_basics(UBFClient ubf)
            throws IOException, UBFException {
        // setup
        UBFObject res1 = ubf.rpc(
               UBF.tuple( new UBFAtom("delete"), new UBFAtom("tab1"),
                          new UBFBinary("foo"), new UBFList(),
                          new UBFInteger(4000)));
        System.out.println("Res 1:" + res1.toString());
    
  2. Add the key foo to table tab1:

    // add - ok
    UBFObject res2 = ubf.rpc(
            UBF.tuple( new UBFAtom("add"), atom_tab1,
                        new UBFBinary("foo"), new UBFBinary("bar"),
                        new UBFInteger(0), new UBFList(),
                        new UBFInteger(4000)));
    System.out.println("Res 2:" + res2.toString());
    if (! res2.equals(atom_ok))
        System.exit(1);
    
  3. Add the key foo to table tab1 again, this time expecting a failure:

    // add - ng
    UBFObject res3 = ubf.rpc(
            UBF.tuple( new UBFAtom("add"), atom_tab1,
                       new UBFBinary("foo"), new UBFBinary("bar"),
                       new UBFInteger(0), new UBFList(),
                       new UBFInteger(4000)));
    System.out.println("Res 3:" + res3.toString());
    if (! ((UBFTuple)res3).value[0].equals(atom_key_exists))
        System.exit(1);
    
  4. Get the key foo from table tab1:

    // get - ok
    UBFObject res4 = ubf.rpc(
            UBF.tuple( new UBFAtom("get"), atom_tab1,
                       new UBFBinary("foo"), new UBFList(),
                       new UBFInteger(4000)));
    System.out.println("Res 4:" + res4.toString());
    if (! ((UBFTuple)res4).value[0].equals(atom_ok) ||
        ! ((UBFTuple)res4).value[2].equals("bar"))
        System.exit(1);
    
  5. Set the key foo in table tab1 to bar bar:

    // set - ok
    UBFObject res5 = ubf.rpc(
            UBF.tuple( new UBFAtom("set"), atom_tab1,
                       new UBFBinary("foo"), new UBFBinary("bar bar"),
                       new UBFInteger(0), new UBFList(),
                       new UBFInteger(4000)));
    System.out.println("Res 5:" + res5.toString());
    if (! res5.equals(atom_ok))
        System.exit(1);
    
  6. Get foo again and verify that the value is bar bar:

    // get - ok
    UBFObject res6 = ubf.rpc(
            UBF.tuple( new UBFAtom("get"), atom_tab1,
                       new UBFBinary("foo"), new UBFList(),
                       new UBFInteger(4000)));
    System.out.println("Res 6:" + res6.toString());
    if (! ((UBFTuple)res6).value[0].equals(atom_ok) ||
        ! ((UBFTuple)res6).value[2].equals("bar bar"))
        System.exit(1);
    
The UBF event handler interface

Each UBFClient instance uses a separate thread to read data from the server and do any of the following:

  1. Signal to the other thread that a synchronous RPC response was received from the server.
  2. Run a callback function when an event_out asynchronous event is received from the server.
  3. The socket was closed unexpectedly.

In cases #2 and #3, a class that implements the UBFEventHandler interface is used to define the action to be taken in those cases.

The HibariTest.java contains a sample implementation of callback functions for asynchronous events. A real application would probably want to do something much more helpful than this example does.

public static class FooHandler implements UBFEventHandler {
    public FooHandler() {
    }
    public void handleEvent(UBFClient client, UBFObject event) {
        System.out.println("Hey, got an event: " + event.toString());
    }
    public void connectionClosed(UBFClient client) {
        System.out.println("Hey, connection closed, ignoring it\n");
    }
}

Tip

See xref:the-ubf-hibaritest-main-method[] for an example that uses this FooHandler class.

[[using-ubf-python-client]]

Using the EBF Client Library for Python

The source code for the EBF client library for Python is included in the UBF source repository at link:http://github.com/ubf/ubf[http://github.com/ubf/ubf], in the priv/python subdirectory.

NOTE: Recall that the EBF protocol is very closely related to UBF. The only significant difference is the “layer 5” session protocol layer: instead of using the UBF(A) protocol, the EBF (Erlang Binary Format) protocol is used instead. See xref:hibari-server-impl-of-ubf-proto-stack[] for more details.

In addition, you will need the “py_interface” package, developed by Tomas Abrahamsson and others. “py-interface” is distributed under the link:http://www.fsf.org/licensing/education/licenses/lgpl.html[GNU Library General Public License]. A git repository is hosted at repo.or.cz. To clone it and build it, use:

$ git clone git://repo.or.cz/py_interface.git
$ cd py_interface
$ autoconf
$ ./configure
$ make
$ pwd

Use the output of the last command, pwd, to remember the full directory path to the “py-interface” library. The example below assumes that path is /tmp/py-interface.

The pyebf.py file contains a small unit test that makes several calls to the Hibari UBF contract’s do_req() command. The results of (almost) every command are verified using the assert function.

env PYTHONPATH=/path/to/py_interface python pyebf.py
  1. Connect to the Hibari server on “localhost” TCP port 7580 and use the UBF metaprotocol to switch to the gdss contract:

    ## login
    ebf.login('gdss', 'gdss_meta_server')
    
  2. Delete the key 'foo' from table tab1:

    ## setup
    req0 = (Atom('do'), Atom('tab1'), [(Atom('delete'), 'foo', [])], [], 1000)
    res0 = ebf.rpc('gdss', req0)
    
  3. Get the key 'foo' from table tab1:

    ## get - ng
    req1 = (Atom('do'), Atom('tab1'), [(Atom('get'), 'foo', [])], [], 1000)
    res1 = ebf.rpc('gdss', req1)
    assert res1[0] == 'key_not_exist'
    
  4. Add the key 'foo' to table tab1. The do_req() interface requires managing the timestamp integers explicitly by the client; the timestamp 1 is used here:

    ## add - ok
    req2 = (Atom('do'), Atom('tab1'),
            [(Atom('add'), 'foo', 1, 'bar', 0, [])], [], 1000)
    res2 = ebf.rpc('gdss', req2)
    assert res2[0] == 'ok'
    
  5. Add the key 'foo' to table tab1:

    ## add - ng
    req3 = (Atom('do'), Atom('tab1'),
            [(Atom('add'), 'foo', 1, 'bar', 0, [])], [], 1000)
    res3 = ebf.rpc('gdss', req3)
    assert res3[0][0] == 'key_exists'
    assert res3[0][1] == 1
    
  6. Get the key 'foo' from table tab1, verifying that the timestamp is still 1 and value is still 'bar':

    ## get - ok
    req4 = (Atom('do'), Atom('tab1'), [(Atom('get'), 'foo', [])], [], 1000)
    res4 = ebf.rpc('gdss', req4)
    assert res4[0][0] == 'ok'
    assert res4[0][1] == 1
    assert res4[0][2] == 'bar'
    
  7. Set the key 'foo' from table tab1, using a new timestamp 2:

    ## set - ok
    req5 = (Atom('do'), Atom('tab1'),
            [(Atom('set'), 'foo', 2, 'baz', 0, [])], [], 1000)
    res5 = ebf.rpc('gdss', req5)
    assert res5[0] == 'ok'
    
  8. Get the key 'foo' from table tab1, verifying both the new timestamp and new value:

    ## get - ok
    req6 = (Atom('do'), Atom('tab1'), [(Atom('get'), 'foo', [])], [], 1000)
    res6 = ebf.rpc('gdss', req6)
    assert res6[0][0] == 'ok'
    assert res6[0][1] == 2
    assert res6[0][2] == 'baz'
    

Client API: Thrift

“TBF” is a link:https://github.com/apache/thrift[Thrift protocol] defined by UBF contract xref:the-hibari-ubf-protocol-contract[]. This section attempts to describe the Hibari Thrift API which allows users to access Hibari with Thrift clients in any Thrift supported programming languages, and how to extend the API for application uses.

The Hibari Thrift API

The Hibari Thrift API is defined as Hibari Service in link:./misc-codes/hibari.thrift[hibari.thrift]. At the time this API was developed, only Thrift 0.4.0 is available to us. This version is our first attempt to adopt Thrift. Some of the functions and options are not yet supported.

Important

The Hibari Thrift API only supports Thrift 0.4.0 or above.

service Hibari {

   /**
    * Check connection availability / keepalive
    */
   oneway void keepalive()

   /**
    * Hibari Server Info
    */
   string info()

   /**
    * Hibari Description
    */
   string description()

   /**
    * Hibari Contract
    */
   string contract()

   /**
    * Add
    */
   HibariResponse Add(1: Add request)
       throws (1:HibariException ouch)

   /**
    * Replace
    */
   HibariResponse Replace(1: Replace request)
       throws (1:HibariException ouch)

   /**
    * Set
    */
   HibariResponse Set(1: Set request)
       throws (1:HibariException ouch)

   /**
    * Rename
    */
   HibariResponse Rename(1: Rename request)
       throws (1:HibariException ouch)

   /**
    * Delete
    */
   HibariResponse Delete(1: Delete request)
       throws (1:HibariException ouch)

   /**
    * Get
    */
   HibariResponse Get(1: Get request)
       throws (1:HibariException ouch)
   }

For each primitive utility function, it has exactly one input parameter. The parameter is an object that has a name matching its function. The object carries all mandatory and optional parameters to Hibari. This object could also be used to implement micro-transactions in the future.

Mapping UBF Contract Types to Thrift Types

You can find more details of the UBF / Thrift type conversion in (link:https://github.com/ubf/ubf-thrift[UBF-Thrift]).

Mapping UBF Contract to Thrift Service

Mapping UBF types to thrift primitives is different from mapping UBF contracts to service. Thrift mainly uses 2 different types to compose a request (struct and field).

If you are using Thrift to generate client code, you probably don’t need to worry about how the request being constructed. Visit link:http://wiki.apache.org/thrift/ThriftGeneration[Thrift Wiki] for the instruction to install Thrift and to generate client code. You will also need link:./misc-codes/hibari.thrift[hibari.thrift] to get started.

If you are interested in the UBF contract, the Hibari NTBF contract can be found in the file of ntbf_gdss_plugin.con.

Examples of using a Thrift client

Once you get the generated code, connecting to Hibari is easy. For example, adding the key 'fookey' to table tab1 with a value of 'Hello, world!' in the following 3 languages.

In Erlang:

-include("hibari_thrift.hrl").

% init
{ok, Client} = thrift_client:start_link("127.0.0.1", 7600, hibari_thrift),

% create the input parameter object
Request = #add{table=<<"tab1">>, key=<<"fookey">>, value=<<"Hello, world!">},

% send request
try
  HibariResponse = thrift_client:call(Client, 'Add', [Request]),
catch
  HibariException ->
    HibariException
end,

ok = thrift_client:close(Client).

In Java:

import com.hibari.rpc.*;

// init
TTransport transport = new TSocket("127.0.0.1", 7600);
TProtocol proto = new TBinaryProtocol(transport);
Hibari.Client client = new Hibari.Client(proto);
transport.open();

// create the input parameter object
Add request = new Add("tab1", ByteBuffer.wrap("fookey".getBytes()),
  ByteBuffer.wrap("Hello, world!".getBytes())))

// send request
try {
  HibariResponse response = client.Add(request);
} catch (HibariException e) {
  // ...
}

transport.close();

In python:

from hibari import Hibari

# init
transport = TSocket.TSocket('localhost', 7600)
transport.setTimeout(None)
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = Hibari.Client(protocol)
transport.open()

# create the input parameter object
request = Add()
request.table = "tab1"
request.key = b"fookey"
request.value = b"Hello, world!"

# send request
response = client.Add(request)

transport.close()

Mapping TBF Contract Responses From Thrift Client

TBF only responses one of two generic types to all functions in Hibari Thrift API, HibariResponse or HibariException. One could expect a HibariResponse in an any successful cases. Otherwise a HibariException should be thrown.

Building Hibari from Source

This section describes the basic recipes to build the following items:

  • Hibari Release Package
  • Hibari Documentation
  • Erlang/OTP System

Required Third Party Software

Before getting started, review this checklist of tools and software. Please install and set up as needed.

Mandatory Items (Required for Building Hibari)

The following software is required in order to download Hibari and build a release package:

  • Git – http://git-scm.com/

    • Must be version 1.5.4 or newer.

      • 1.7.3.4 is the version most recently tested for Hibari.
    • If you haven’t yet done so, please configure your email address and name for Git:

      $ git config --global user.email "you@example.com"
      $ git config --global user.name "Your Name"
      
    • If you haven’t yet done so, you must sign up for a GitHub account – https://github.com/

      • Anonymous read-only access using the GIT protocol is default.
      • Team members with read-write access: be sure to add your SSH public key under your GitHub account.
  • Python – http://www.python.org

    • Required by Repo

    • Must be version 2.4 or newer

      • 2.7 is the version most recently tested for Hibari.

      Caution

      Python 3.x might be too new.

  • Repo – http://source.android.com/source/git-repo.html

    • Install as follows:

      $ mkdir -p ~/bin
      $ curl http://commondatastorage.googleapis.com/git-repo-downloads/repo > ~/bin/repo
      $ chmod a+x ~/bin/repo
      
    • The downloading and packaging process also uses Rebar (https://github.com/basho/rebar/wiki) but this tool is included in the Hibari Git repositories so you do not need to install it separately.

  • OpenSSL – http://www.openssl.org/

    • Required for Erlang’s crypto module.
  • Erlang/OTP – http://www.erlang.org/

    • Must be version R16B01 or newer.
      • 17.4 is the version most recently tested for Hibari.
    • For information on building Erlang/OTP from source, see <<ErlangOTP>> in this document.
Optional Items (Required for Building Hibari’s Documentation)

The following software is required only if you want to build Hibari’s documentation from source. Note that an online version of the documentation is available at http://hibari.github.com/hibari-doc/.

Downloading Hibari

Follow these steps to download the Hibari repositories from GitHub.

  1. Create a working directory and retrieve the Hibari manifest files:

    $ mkdir working-directory
    $ cd working-directory
    $ repo init -u git://github.com/hibari/manifests.git -m hibari-default.xml
    

    Note

    Your “Git” identity is needed during the repo init step. Please enter the name and email of your GitHub account if you have one. Team members having read-write access should use repo init -u git@github.com:hibari/manifests.git -m hibari-default-rw.xml.

    Tip

    If you want to checkout the latest development version of Hibari, please append `` -b dev`` to the repo init command.

  2. Download Hibari’s Git repositories:

    $ Repo sync
    

    After the repo sync, your working directory has the following structure:

    <working-directory>
     |- hibari/
       |- .git/
       |- .gitignore
       |- Makefile
       |- dialyze-ignore-warnings.txt
       |- dialyze-nospec-ignore-warnings.txt
       |- lib/                             <1>
         |- <application_name>/
           |- .git/
           |- .gitignore
           |- ebin/
           |- include/
             |- *.hrl
           |- priv/
           |- rebar.config
           |- src/
             |- <application_name>.app.src
             |- *.erl
           |- test/
             |- eunit/
               |- *.erl
             |- eqc/
               |- *.erl
         :
       |- rebar
       |- rebar.config
       |- rel/                             <2>
         |- files/
           |- app.config
           |- erl
           |- hibari
           |- hibari-admin
           |- nodetool
           |- nodetool-admin
           |- vm.args
         |- hibari/
           :
           |- releases/
             |- <release_vsn>/
               :
             :
           :
         |- reltool.config
     |- hibari-doc/                        <3>
       :
     |- manifests/                         <4>
       :
     |- patches/                           <5>
       :
     |- rebar/                             <6>
       :
     |- .repo/
       :
    

<1> Applications <2> Releases <3> Documentation <4> Manifests <5> Patches <6> Rebar

Building the Hibari Release Package

Follow these steps to build a Hibari release package.

  1. Building basic recipe:

    $ cd working-directory/hibari
    $ make
    

Tip

If the response is “make: erl: Command not found”, please make sure Erlang/OTP is installed and “otp-installing-directory-name/bin” is added to your $PATH environment.

  1. Release packaging basic recipe:

    $ cd working-directory/hibari
    $ make package
    

Note

A release package tarball “hibari-X.Y.Z-dev-ARCH-WORDSIZE.tgz” and md5sum file “hibari-X.Y.Z-dev-ARCH-WORDSIZE-md5sum.txt” is written into your working-directory. You can then use these files to perform a single-node or multi-node Hibari installation as described in <<getting-started>>.

[[HibariAsciiDoc]]

Building Hibari’s Documentation

Follow these steps to build Hibari’s documentation.

  1. Building Hibari’s “Guides” basic recipe:

    $ cd working-directory/hibari-doc/src/hibari
    $ make clean -OR- make realclean
    $ make
    
  2. Building Hibari’s “Website” basic recipe:

    $ cd working-directory/hibari-doc/src/hibari/website
    $ make clean -OR- make realclean
    $ make
    

Note

HTML documentation is written in the ”./public_html” directory.

Hibari’s documentation is authored using AsciiDoc and a few auxillary tools:

  • ImageMagick
  • dblatex
  • Dia
  • Graphviz
  • Mscgen
  • w3m

Hibari’s documentation is generated with AsciiDoc and a manually modified version of the a2x tool. A fake lang-ja.conf file can be easily created by making a symlink to the lang-en.conf file.

diff -r -u 8.6.4-orig/bin/a2x.py 8.6.4/bin/a2x.py
--- 8.6.4-orig/bin/a2x.py    2011-04-24 00:50:26.000000000 +0900
+++ 8.6.4/bin/a2x.py 2011-04-24 00:35:55.000000000 +0900
@@ -156,7 +156,10 @@
  def shell_copy(src, dst):
    verbose('copying "%s" to "%s"' % (src,dst))
      if not OPTIONS.dry_run:
-        shutil.copy(src, dst)
+        try:
+            shutil.copy(src, dst)
+        except shutil.Error:
+            return

  def shell_rm(path):
      if not os.path.exists(path):
 Only in 8.6.4/etc/asciidoc: lang-ja.conf

[[ErlangOTP]]

Building and Installing Erlang/OTP

Follow these steps to download and build Erlang/OTP from source, and to install the system. These steps provide a basic recipe; not all options are addressed.

Note

Please make sure to have the ‘openssl-devel’ package installed on your system before configuring and building Erlang/OTP.

  1. Download the source code for your Erlang/OTP system:

    $ cd working-directory
    $ wget http://www.erlang.org/download/otp_src_R14B01.tar.gz
    
  2. Untar the source code for your Erlang/OTP system:

    $ tar -xzf otp_src_R14B01.tar.gz
    
  3. Configure Erlang/OTP:

    $ cd working-directory/otp_src_R14B01
    $ ./configure --prefix=otp-installing-directory-name
    
  4. Build Erlang/OTP:

    $ make
    
  5. Install Erlang/OTP:

    $ sudo make install
    

Caution

Please make sure “otp-installing-directory-name/bin” is added to your $PATH environment.

Contributing to Hibari

GitHub, Git, and Repo

to be added

List the working directories for all of Hibari’s “projects”:

$ repo forall -c "pwd"

Note

Each project has a corresponding Git repository and (default) revision. Check the “manifests/hibari-default.xml” file for details.

Start a new topic (e.g. new-topic-name) branch:

$ repo start new-topic-name `repo forall -c "pwd" | xargs echo`

Abandon an existing topic (e.g. topic-name) branch:

$ repo abandon topic-name `repo forall -c "pwd" | xargs echo`

Track and checkout the master branch:

$ repo forall -c "git branch --track master github/master"
$ repo forall -c "git checkout master"

Track and checkout the dev (i.e. Development) branch:

$ repo forall -c "git branch --track dev github/dev"
$ repo forall -c "git checkout dev"

Code, Branch, and Version Management

to be added

Documentation

to be added

Submitting Patches

to be added

System Administration

Hibari System Administrator’s Guide (Hibari v0.1.11)

DRAFT - IN PROGRESS

Date: 2015/03/22
Revision: 0.5.4

Copyright (C) 2005-2015 Hibari developers. All rights reserved.

Table of Contents

Introduction

Caution

This document is under re-construction – beware!

The Problem

There exists a dichotomy in modern storage products. Commodity storage is inexpensive, but unreliable. Enterprise storage is expensive, but reliable. Large capacities are present in both enterprise and commodity class. The problem, then, becomes how to leverage inexpensive commodity hardware to achieve high capacity enterprise class reliability at a fraction of the cost.

This problem space has been researched extensively, especially in the last few years: in academia, the commercial sector, and by open source community. Hibari uses techniques and algorithms from this research to create a solution which is reliable, cost effective, and scalable.

Key-Value Store

Hibari is key-value store. If a key-value store were represented as an SQL table, it would be defined as:

[[sql-definition-key-value]]

SQL-like definition of a generic key value store
CREATE TABLE foo (
    BLOB key;
    BLOB value;
) PRIMARY KEY key;

In truth, each key stored in Hibari has three additional fields associated with it. See xref:hibari-data-model[] and link:hibari-contributor-guide.en.html[Hibari Contributor’s Guide] for details.

[[hibari-origins]]

Hibari’s Origins

Hibari was originally written by Cloudian, Inc., formerly Gemini Mobile Technologies, to support mobile messaging and email services. Hibari was released outside of Cloudian under the Apache Public License version 2.0 in July 2010.

Hibari has been deployed by multiple telecom carriers in Asia and Europe. Hibari may lack some features such as monitoring, event and alarm management, and other “production environment” support services. Since telecom operator has its own data center support infrastructure, Hibari’s development has not included many services that would be redundant in a carrier environment.

We hope that Hibari’s release to the open source community will close those functional gaps as Hibari spreads outside of carrier data centers.

Summary of Hibari’s Main Features
  • A Hibari cluster is a distributed system.
  • A Hibari cluster is linearly scalable.
  • A Hibari cluster is highly available.
  • All updates are durable.
  • All updates are strongly consistent.
  • All client operations are lockless.
  • A Hibari cluster’s performance is excellent.
  • Multiple client access protocols are available.
  • Data is repaired automatically after a server failure.
  • Cluster configuration can be changed at any time.
  • Data is automatically rebalanced.
  • Heterogeneous hardware support is easy.
  • Micro-transactions simplify creation of robust client applications.
  • Per-table configurable performance options are available.

[[acid-base-hibari]]

The “ACID vs. BASE” Spectrum and Hibari

Important

We strongly believe that “ACID” and “BASE” properties exist on a spectrum and are not exclusively one or the other (black-or-white) properties.

Most database users and administrators are familiar with the acronym ACID: Atomic, Consistent, Independent, and Durable. Now, consider an alternative method of storing and managing data, BASE:

  • Basically available
  • Soft state
  • Eventually consistent

For an link:http://queue.acm.org/detail.cfm?id=1394128[exploration of ACID and BASE properties (at ACM Queue)], see:

BASE: An Acid Alternative Dan Pritchett ACM Queue, volume 6, number 3 (May/June 2008) ISSN: 1542-7730 http://queue.acm.org/detail.cfm?id=1394128

When both strict ACID and strict BASE properties are placed on a spectrum, they are at the opposite ends. However, a distributed database system can fit anywhere in the middle of the spectrum.

A Hibari cluster lies near the ACID end of the ACID/BASE spectrum. In general, Hibari’s design will always favors consistency and durability of updates at the expense of 100% availability in all situations.

[[cap-theorem-and-hibari]]

The CAP Theorem and Hibari

Warning

Eric Brewer’s “CAP Theorem”, and its proof by Gilbert and Lynch, is a tricky thing. It’s nearly impossible to cleanly apply the purity of logic to the dirty world of real, industrial computing systems. We strongly suggest that the reader consider the CAP properties as a spectrum, one of balances and trade-offs. The distributed database world is not black and white, and it is important to know where the gray areas are.

See the link:http://en.wikipedia.org/wiki/CAP_theorem[Wikipedia article about the CAP theorem] for a summary of the theorem, its proof, and related links.

CAP Theorem (postulated by Eric Brewer, Inktomi, 2000) Wikipedia http://en.wikipedia.org/wiki/CAP_theorem

Hibari chooses the C and P of CAP. It utilizes chain replication technique and it always guarantees strong consistency. Hibari also includes an Erlang/OTP application specifically for detecting network partitions, so that when a network partition occurs, the brick nodes in the opposite side of the partition with the active master will be removed from the chains to keep the strong consistency guarantee.

See xref:admin-server-and-network-partition[] for details.

Hibari’s Main Features in Broad Detail

=== Distributed system

Multiple machines can participate in a single cluster. The maximum size of a Hibari cluster has not yet been determined. A practical limit of approximately 200-250 nodes is likely.

Any server node can handle any client request, forwarding a request to the correct server node when necessary. Clients maintain enough state to send their queries directly to the correct server node in all common cases.

=== Scalable system

The total storage and processing capacity of a Hibari cluster increases linearly as machines are added to the cluster.

=== Durable updates

Every key update is written and flushed to stable storage (via the fsync() system call) before sending acknowledgments to the client.

=== Consistent updates

After a key’s update is acknowledged, no client in the cluster can see an older version of that key. Hibari uses the “chain replication” algorithm to maintain consistency across all replicas of a key.

All data written to disk include MD5 checksums; the checksums are validated on each read to avoid sending corrupted data to the client.

[[lockless-client-api]] === Lockless client API

The Hibari client API requires that all operations (read queries operations and/or update operations) be self-contained within a single client request. Therefore, locks are not implemented because they are not required.

Inside Hibari, each key-value pair also contains a ``timestamp’’ value. A timestamp is an integer. Each time the key is updated, the timestamp value must increase. (This requirement is enforced by all server nodes.)

In many database systems, if a client requires guarantees that a key has not changed since the last time it was read, then the client acquires a lock (or lease) on the key. In Hibari, the client’s update specifies the timestamp of the last read attempt of the key:

  • If the timestamp matches the server, the operation is permitted.
  • If the timestamp does not match the server’s timestamp, then the operation is not permitted, and the new timestamp is returned to the client.

It is recommended that all Hibari nodes use NTP to synchronize their system clocks. The simplest Hibari client API uses timestamps based upon the OS system clock for timestamp values. This feature can be bypassed, however, by using a slightly more complex client API.

However, Hibari’s overload detection and work-dumping algorithms will use the OS system clock, regardless of which client API is used. All system clocks, client and server, be synchronized to be within roughly 1 second of each other.

=== High availability

Each key can be replicated multiple times (configurable on a per-table basis). As long as one copy of the key survives, all operations on that key are permitted. A cluster can survive multiple cluster node failures and still maintain full data integrity.

The cluster membership application, called the Hibari Admin Server, runs as an active/standby application on one or more of the server nodes. The Admin Server’s configuration and private state are also maintained in Hibari server nodes. Shared storage such as NFS, shared SCSI/Fibre Channel LUNs, or replicated block devices are not required.

If the Admin Server fails and is restarted on a standby node, the rest of the cluster can continue normal operation. If another brick fails while the Admin Server is restarting, then clients may see service interruptions (usually in the form of timeouts) until the Admin Server has finished restarting and can react to the failure.

=== Multiple Client Protocols

Hibari supports many client protocols for queries and updates:

  • A native Erlang API, via Erlang’s native message-passing mechanism
  • Amazon S3 protocol, via HTTP
  • UBF, Joe Armstrong’s ``Universal Binary Format’’ protocol, via TCP
  • UBF via several minor variations of TCP transport
  • UBF over JSON-RPC, via HTTP
  • JSON-encoded UBF, via TCP

Protocols under development:

  • Memcached, via TCP
  • UBF over Thrift, via TCP
  • UBF over Protocol Buffers, via TCP

Most of the client access protocols are implemented using the Erlang/OTP application behavior. By separating each access protocol into separate OTP applications, Hibari’s packaging is quite flexible: packaging can add or remove protocol support as desired. Similarly, protocols can be stopped and started at runtime.

[[overview-high-performance]] === High performance

Hibari’s performance is competitive with other distributed, non-relational databases such as HBase and Cassandra, when used with similar replication and durability configurations. Despite the constraints of durable writes and strong consistency, Hibari’s performance can exceed those databases on some workloads.

IMPORTANT: The metadata of all keys stored by the brick, called the ``key catalog’‘, are stored in RAM to accelerate commonly-used operations. In addition, non-zero values of the “expiration_time” and non-empty values of “flags” are also stored in RAM (see xref:sql-definition-hibari[]). As a consequence, a multi-million key brick can require many gigabytes of RAM.

=== Automatic repair

Replicas of keys are automatically repaired whenever a cluster node crashes and restarts.

=== Dynamic configuration

The number of replicas per key can be changed without service interruption. Likewise, replication chains can be added or removed from the cluster without service interruption. This permits the cluster to grow (or shrink) as workloads change. See xref:chain-migration[] for more details.

=== Data rebalancing

Keys will be automatically be rebalanced across the cluster without service interruption. See xref:chain-migration[] for more details.

=== Heterogeneous hardware support

Each replication chain can be assigned a weighting factor that will increase or decrease the percentage of a table’s key space relative to all other chains. This feature can permit use of cluster nodes with different CPU, RAM, and/or disk capacities.

=== Micro-Transactions

Under limited circumstances, operations on multiple keys can be given transactional commit/abort semantics. Such micro-transactions can considerably simplify the creation of robust applications that keep data consistent despite failures by both clients and servers.

[[per-table-config-perf-options]] === Per-table configurable performance options

Each Hibari table may be configured with the following options to enhance performance ... though each of these options has a corresponding price to pay.

  • RAM-based storage: All data (both keys and values) may be stored in RAM, at the expense of increased RAM consumption. Disk is used still used to log all updates, to protect against a catastrophic power failure.
  • Asynchronous writes: Use of the fsync() system call can be disabled to improve performance, at the expense of data loss in a system crash or power failure.
  • Non-durable updates: All update logging can be disabled to improve performance, at the expense of data loss when all nodes in a replication chain crash.
Building A Hibari Database

=== Defining a Schema

Hibari is a key-value database. Unlike a relational DBMS, Hibari applications do not need to create a schema. The only application requirement is that all its tables be created in advance, see xref:creating-new-tables[] below.

[[hibari-data-model]] === The Hibari Data Model

If a Hibari table were represented within an SQL database, it would look something like this:

[[sql-definition-hibari]] .SQL-like definition of a Hibari table include::texts-src/hibari-sql-definition.txt[]

Hibari table names use the Erlang data type ``atom’‘. The types of all key-related attributes are presented below.

.Types of Hibari table key-value attributes include::texts-src/hibari-key-value-attrs.txt[]

include::texts-src/hibari-key-value-attrs-expl.txt[]

The practical constraints on maximum value blob size are affected by total blob size and frequency of large blob access. For example, storing an occasional 64MB value blob is different than a 100% write workload of 100% 64MB value blobs. The Hibari client API does not have a method to update or fetch less than the entire value blob, so a brick can be blocked for many seconds if it tried to operate on (for example) even a single 4GB blob. In addition, other processes can be blocked by ‘busy_dist_port’ events while processing big value blobs.

=== Hibari’s Client Operations

Hibari’s basic client operations are enumerated below.

add:: Set a key/value/expiration/flags only if the key does not already exist. delete:: Delete a key get:: Get a key’s timestamp and value get_many:: Get a range of keys replace:: Set a key/value/expiration/flags only if the key does exist set:: Set a key/value/expiration/flags txn:: Start of a micro-transaction

Each operation can be accompanied by operation-specific flags. Some of these flags include:

witness:: Do not return the value blob. (get, get_many) must_exist:: Abort micro-transaction if key does not exist. must_not_exist:: Abort micro-transaction if key does exist. {testset, TS}:: Perform the action only if the key’s current timestamp exactly matches TS. (delete, replace, set, micro-transaction)

For details of these operations and lesser-used per-operation flags, see:

  • xref:micro-transactions[]
  • link:hibari-contributor-guide.en.html[Hibari Contributor’s Guide]

=== Indexes

Hibari does not support automatic indexing of value blobs. If an application requires indexing, the application must build and maintain those indexes.

[[creating-new-tables]] === Creating New Tables

New tables can be created by two different methods:

  • Via the Admin Server’s status server. Follow the “Add a table” link at the bottom.
  • Using the Erlang shell.

For details on the Erlang shell API and detailed explanations of the table options presented in the Admin server’s HTTP interface, see the link:hibari-contributor-guide.en.html[Hibari Contributor’s Guide]

Hibari Architecture

From a logical point of view, Hibari’s architecture has three layers:

  • Top layer: consistent hashing
  • Middle layer: chain replication
  • Bottom layer: the storage brick

This section discusses each of these major layers in detail, starting from the bottom and working upward.

.Logical architecture diagram; physical hosts/bricks are color-coded with 5 colors svgimage::images/logical-architecture1[align=”center”, scaledwidth=”80%”]

.Logical architecture diagram, alternative perspective svgimage::images/logical-architecture-alt[align=”center”, scaledwidth=”80%”]

Bricks, Physical and Logical

The word “brick” has two different meanings in a Hibari system:

  • An entire physical machine that has Hibari software installed, configured, and (hopefully) running.
  • A logical software entity that runs inside the Hibari application that is responsible for managing key-value pairs.

[[the-physical-brick]]

The physical brick

The phrase “physical brick” and “machine” are interchangeable, most of the time. Hibari is designed to react correctly to the failure of any part of the machine that the Hibari application is running:

  • disk
  • power supply
  • CPU
  • network

Hibari is designed to take advantage of low-cost, off-the-self commodity servers.

A physical brick is the basic unit of failure. Data replication (via the chain replication algorithm) is responsible for protecting data, not redundant equipment such as dual power supplies and RAID disk subsystems. If a physical brick crashes for any reason, copies of data on other physical bricks can still be used.

It is certainly possible to decrease the chances of data loss by using physical bricks with more expensive equipment. Given the same number of copies of a key-value pair, the chances of data loss are less if each brick has multiple power supplies and RAID 1/5/6/10 disk. But risk of data loss can also be reduced by increasing the number of data replicas (“chain length”) using cheaper, non-redundant server hardware.

The logical brick

A logical brick is a software entity that runs within a Hibari application instance on a physical brick. A single Hibari physical brick can support dozens or (potentially) hundreds of logical bricks, though limitations of CPU, RAM, and/or disk capacity can impose a smaller limit.

A logical brick maintains RAM and disk data structures to store a collection of key-value pairs. The keys are maintained in lexicographic sorting order.

The replication technique used by Hibari, chain replication, maintains identical copies of key-value pairs across multiple logical bricks. The number of copies of a key-value pair is exactly equal to the length of the chain. See the next subsection below for more details.

It is possible to configure Hibari to place all of the logical bricks for the same chain onto the same physical brick. This practice can be useful in a developer’s environment, but it is impractical for production networks: such a configuration does not have any physical redundancy, and therefore it poses a greater risk of data loss.

[[write-ahead-logs]]

Write-Ahead Logs

By default, all logical bricks will record all updates to a write-ahead log. Used by many database systems, a write-ahead log (WAL) appears to be an infinitely-sized log where all important events (e.g. all write and delete operations) are appended to the end of the log. The log is considered write-ahead if a log entry is written prior to any significant processing by the application.

[[write-ahead-logs-in-hibari]]

Write-ahead logs in the Hibari application

Two types of write-ahead logs are used by the Hibari application. These logs cooperate with each other to provide several benefits to the logical brick.

There are two types of write-ahead logs:

  • The shared common log. This single write-ahead log instance provides durability guarantees to all logical bricks within the server node via the fsync() system call.
  • Individual private logs. Each logical brick maintains its own private write-ahead log instance. All metadata regarding keys in the logical brick are stored in the logical brick’s private log.

All updates are written first to the common log, usually in a synchronous manner. At a later time, update metadata is lazily copied from the common log to the corresponding brick’s private log. Value blobs (for bricks that store value blobs on disk) will remain in the common log and are managed by the scavenger, see xref:scavenger[].

svgimage::images/private-and-common-logs[align=”center”, scaledwidth=”80%”]

[[two-wal-types]]

Two types of write-ahead logs

The two log types cooperate to support a number of useful properties.

  • Data durability in case of system crash or power failure. All synchronous writes to the ``common log’’ are guaranteed to be flushed to stable storage.
  • Performance enhancement by limiting fsync() usage. After a logical brick writes data to the common log, it will request an fsync(). The common log will combine fsync() requests from multiple bricks into a single system call.
  • Performance enhancement at logical brick startup. A brick’s private log stores only that bricks key metadata. Therefore, at startup time, the logical brick does not scan data maintained by other logical bricks. This can be a very substantial time savings as the amount of metadata managed by all logical bricks grows over time.
  • Performance enhancement by separating synchronous writes from asynchronous writes. If the common log’s storage is on a separate device, e.g. a write-optimized flash memory block device, then all of the fsync() calls can finish much faster. During later processing of the asynchronous/lazy copying of key metadata from the common log to individual private logs can take advantage of OS dirty page write coalescing and other I/O optimizations without interference by fsync(). These copies are performed roughly once per second.

[[wal-dirs-and-files]]

Directories and files used by write-ahead logs

Each write-ahead log is stored on disk as a collection of large files (default = 100MB each). Each file in the log is identified by a log sequence number and is called a log sequence file.

Log sequence files are append-only and are never written again. Consequently, data in a log sequence file is never overwritten. Any disk space reclaimed by checkpoint and scavenger operations is done by copying data from old log sequence files and appending to new log sequence files. Once the new log sequence file(s) is flushed to stable storage, the old log sequence file(s) can be deleted.

When a log sequence file reaches its maximum size, the current log file is closed and a new one is opened with a monotonically increasing log serial number.

All log files for a write-ahead log are grouped under a single directory called, hlog.{log-name}, where {log-name} is the name of the brick or of the common log. These directories are stored under the var/data subdirectory of the application’s installation path, /usr/local/TODO/TODO/var/data (by default).

The maximum log file size (brick_max_log_size_mb in the central.conf file) is advisory only and is not enforced as a hard limit.

Reclaiming disk space used by write-ahead logs

In practice, infinite storage is not yet available. The Hibari system uses two mechanisms to reclaim unused disk space:

  • The checkpoint mechanism, see xref:checkpoints[].
  • The scavenger mechanism, see xref:scavenger[].
Write-ahead log serial numbers

Each item written in a write-ahead log is assigned a serial number. If the brick is in standalone or head roles, then the serial number will be assigned by that brick. For downstream bricks, the serial number assigned by the head brick will be used.

The serial number mechanism is used to ensure that a single unique ordering of log items will be written to each brick log. In certain failure cases, log items may be re-sent down the chain a second time, see xref:failure-middle-brick[].

// JWN: Does the above mechanism “to ensure that a single unique ordering” // applies to both common log and private log?

[[chains]] === Chains

A chain is the unit of data replication used by the link:http://www.usenix.org/events/osdi04/tech/renesse.html[``chain replication’’ technique as described in this paper]:

Chain Replication for Supporting High Throughput and Availability Robbert van Renesse and Fred B. Schneider USENIX OSDI 2004 conference proceedings http://www.usenix.org/events/osdi04/tech/renesse.html

Data replication algorithms can be separated into two basic families:

  • State machine replication
  • Quorum replication

The chain replication algorithm is from the state machine family of replication algorithms. It is a variation of the familiar ``master/slave’’ replication algorithm, where all updates are sent to a master node and then copies are sent to zero or more slave nodes.

Chain replication requires a very specific ordering of nodes (which store copies of data) and the messages passed between them. The diagram below depicts the “key update” message flow in a chain of length three.

[[diagram-write-path-3]] .Message flow in a chain for a key update svgimage::images/write-path-3[align=”center”, scaledwidth=”80%”]

If a chain is of length one, then the same brick assumes both ``head’’ and ``tail’’ roles simultaneously. In this case, the brick is called a ``standalone’’ brick.

.Message flow for a key update to a chain of length 1 svgimage::images/write-path-1[align=”center”, scaledwidth=”30%”]

To maintain the property strong consistency, a client must read data from the tail brick in the chain. A read processed by any other brick member would permit the client to read an update that has not yet been processed by all bricks and therefore could result in a strong consistency violation. Such a violation is frequently called a ``dirty read’’ in other database systems.

.Message flow for a read-only key query svgimage::images/read-path-3[align=”center”, scaledwidth=”80%”]

[[bricks-outside-chain-replication]] ==== Bricks outside of chain replication

During Hibari’s development, we encountered a problem of managing the state required by the Admin Server. If data managed by chain replication requires the Admin Server to be running, how can the Admin Server read its own data? There is a ``chicken and the egg’’ dependency problem that must be solved.

// JWN: Why wasn’t Mnesia used for the Admin Server’s storage // implementation?

The solution is simple: do not use chain replication to manage the Admin Server’s data. Instead, that data is replicated using a simple ``quorum replication’’ technique. These bricks all have names starting with the string “bootstrap”.

A brick must be in ``standalone’’ mode to answer queries when it is used outside of chain replication. See xref:brick-roles[] for details on the standalone role.

=== Tables

A table is thing that divides the key namespace within Hibari. If you need to have two different keys called “foo” but have different values, you store each “foo” key in a separate table. The same is true in other database systems.

Hibari’s implementation uses one or more replication chains to store the data for one table.

.Relationship between tables, chains, and bricks. svgimage::images/table-chain-brick[align=”center”, scaledwidth=”70%”]

[[micro-transactions]] === Micro-Transactions

In a single request, a Hibari client may send multiple update operations to the cluster. The client has the option of requesting ``micro-transaction’’ semantics for those updates: if there are no errors, then all updates will be applied atomically. This behaves like the ``transaction commit’’ behavior supported by most relational databases.

On the other hand, if there is an error while processing one of the update operations, then all of update operations will fail. This behaves like the ``transaction abort’’ behavior supported by most relational databases.

Unlike most relational databases, Hibari does not have a transaction manager that can coordinate ACID semantics for arbitrary read and write operations across any row in any table. In fact, Hibari has no transaction manager at all. For this reason, Hibari calls its limited transaction feature ``micro-transactions’‘, to distinguish this feature from other database systems.

Hibari’s micro-transaction support has two important limitations:

  • All keys involved in the transaction must be stored in the same replication chain (and therefore by the same brick(s)).
  • Operations within the micro-transaction cannot see updates by other operations within the the same micro-transaction.

.Four keys in the “footab” table, distributed across two chains of length three. [id=”footab-example”] svgimage::images/micro-transaction-example[align=”center”, scaledwidth=”70%”]

In the diagram above, a micro-transaction can be permitted if it operates on only the keys “string1” & “string4” or only the keys “string2” and “string3”. If a client were to send a micro-transaction that operates on keys “string1” and “string3”, the micro-transaction will be rejected: key “string3” is not stored by the same chain as the key “string1”.

.Valid micro-transaction: all keys managed by same chain [id=”valid-utxn”]

[txn,
{op = replace, key = “string1”, value = “Hello, world!”}, {op = delete, key = “string4”}

]

.Invalid micro-transaction: keys managed by different chains [id=”invalid-utxn”]

[txn,
{op = replace, key = “string1”, value = “Hello, world!”}, {op = delete, key = “string2”}

]

The client does not have direct control over how keys are distributed across chains. When a table is defined and created, its configuration specifies the algorithm used to map a {TableName, Key} pair to a specific chain.

// JWN: This might be a good place to briefly explain the benefits of // using a key prefix and how it is beneficial to (some) applications.

NOTE: See link:hibari-contributor-guide.en.html#add-a-new-table[Hibari Contributor’s Guide, “Add a New Table” section] for more information about table configuration.

=== Distribution: Workload Partitioning and Fault Tolerance

[[consistent-hashing-example]] ==== Partitioning by consistent hashing

To spread computation and storage workloads across all servers in the cluster, Hibari uses a technique called ``consistent hashing’‘. This hashing technique attempts to distribute a table’s key space evenly across all chains used by that table.

IMPORTANT: The word ``consistent’’ has slightly different meanings relative to ``consistent hashing’’ and ``strong consistency’‘. The consistent hashing algorithm is a commonly-used algorithm for key -> storage location calculations. Consistent hashing does not affect the ``eventual consistency’’ or ``strong consistency’’ semantics of a database system.

See the xref:footab-example[] for an example of a table with two chains.

See link:hibari-contributor-guide.en.html#add-a-new-table[Hibari Contributor’s Guide, “Add a New Table” section] for details on valid options when creating new tables.

===== Consistent hashing algorithm

Hibari uses the following steps in its consistent hashing algorithm implementation:

  • Calculate the ``hashing prefix’‘, using part or all of the key as input to the next step.
** This step is configurable, using built-in functions or by providing
a custom implementation function.

** Built-in prefix functions: * Null: use entire key * Fixed length, e.g. 4 byte or 8 byte constant length prefix. *** Variable length: use separator character ‘/’ (configurable)

such that hash prefix is found between the first two (also configurable) ‘/’ characters. E.g. If the key is /user/bar, then the string /user/ is used as the hash prefix.
  • Calculate the MD5 checksum of the hashing prefix and then convert the result to the unit interval, 0.0 - 1.0, using floating point arithmetic.
  • Consult the unit interval -> chain map to calculate the chain name.
** This map contains a tree of {StartValue, EndValue, ChainName} tuples.
For example, {0.0, 0.5, footab_ch1} will map the interval (0.0, 0.5] to the chain named footab_ch1.
*** The mapping tree’s construction is affected by the chain weighting
factor. The weighting factor allows some chains to store more than other chains.
  • Use the operation type to calculate the brick name.

** For read-only operations, choose the tail brick. ** For update operations, choose the head brick.

===== Consistent hashing algorithm use within the cluster

  • Hibari clients use the algorithm to calculate which chain must

handle operations for a key. Clients obtain this information via updates from the Hibari Admin Server. These updates allow the client to send its request directly to the correct server in most use cases.

  • Servers use the algorithm to verify that the client’s calculation

was correct. ** If a client sends an operation to the wrong brick, the brick will forward the operation to the correct brick. ** If a client sends a list of operations such that some bricks are stored on the brick and other keys are not, an error is returned to the client. Micro-transactions are not supported across chains.

===== Changing consistent hashing configuration dynamically

Hibari’s Admin Server will allow changes to the consistent hashing algorithm without service interruption. Such changes are applied on a per-table basis:

  • Adding or removing chains to the unit interval -> chain map.
  • Modifications of the chain weighting factor.
  • Modifying the key -> hashing prefix calculation function.

See the xref:chain-migration[] section for more information.

==== Multiple replicas for fault tolerance

For fault tolerance, data replication is required. As explained in xref:chains[], the basic unit of failure is the brick. The chain replication algorithm will maintain replicas of keys in a strongly consistent manner across all bricks: head, middle, and tail bricks.

To be able to tolerate F failures without data loss or service interruption, each replication chain must be at least F+1 bricks long. This is in contrast to quorum replication family algorithms, which typically require 2F+1 replica bricks.

// JWN: Would it be helpful to put a note that typically “3” is the // recommended number of replicas?

===== Changing chain length configuration dynamically

Hibari’s Admin Server will allow changes to a chain’s length without service interruption. Such changes are applied on a per-chain basis. See the xref:chain-length-change[] section for more information.

The Admin Server Application

The Hibari ``Admin Server’’ is an OTP application that runs in an active/standby configuration within a Hibari cluster. The Admin Server is responsible for:

  • Monitoring the health of each brick in the cluster, see xref:brick-lifecycle-fsm[].
  • Monitoring the status of each chain in the cluster, see xref:chain-lifecycle-fsm[].
  • Managing administrative changes of chain -> brick mappings, see xref:chain-length-change[].
  • Managing data rebalancing, see xref:chain-migration[].
  • Communicating cluster status to Hibari client nodes.
  • Other administrative tasks, such as the creation of new tables.

Only one instance of the Admin Server is permitted to run within the cluster at a time. The Admin Server runs in an ``active/standby’’ configuration that is used in many high-availability clustered applications. The nodes that are eligible to participate in the active/standby configuration are configured via the main Hibari configuration file; see xref:admin-server-in-central-conf[] and xref:central-conf-parameters[] for more details.

=== Admin Server Active/Standby Implementation

The active/standby application failover is handled by the Erlang/OTP application controller. No extra third-party software is required. See Chapter 7, “Applications”, and Chapter 9, “Distributed Applications”, in the “OTP Design Principles User’s Guide” at http://www.erlang.org/doc/design_principles/distributed_applications.html.

[[bootstrap-bricks]] === Admin Server’s Private State: the Bootstrap Bricks

On each active and standby node, there is a hint file called Schema.local which contains the name of the ``bootstrap bricks’‘. These bricks operate outside of the chain replication algorithm to provide redundant, persistent state for the Admin Server application. See xref:bricks-outside-chain-replication[] for a short summary of standalone bricks.

All of the Admin Server’s private state is stored in the bootstrap bricks. This includes:

  • All table definitions and their configuration, e.g. consistent hashing parameters.
  • Status of all bricks and all chains.
  • Operational history of all bricks and all chains.

With the help of the Erlang/OTP application controller and the Hibari Partition Detector application, only a single instance of the Admin Server is permitted to run at any one time. That single application instance has full control over the data stored in the bootstrap bricks and therefore does not have to manage concurrent updates to bootstrap brick data.

=== Admin Server Crash and Restart

When the Admin Server application is stopped (e.g. node shutdown) or crashes (e.g. software bug, power failure), all of the tasks outlined at the beginning of xref:admin-server-app[] are halted. In theory, the 20-30 seconds that are required for the Admin Server to restart could mean 20-30 seconds of negative service impact to Hibari clients.

In practice, however, Hibari clients almost never notice when an Admin Server instance crashes and restarts. Hibari clients do not need the Admin Server when the cluster is stable. The Admin Server is only necessary when the state of the cluster changes. Furthermore, as far as clients are concerned, clients are only affected when bricks crash. Other cluster change events, such as when chain replication repair finished, do not directly impact clients and thus can wait for the Admin Server to finish restarting.

A Hibari client will only notice an Admin Server crash if another logical brick crashes while the Admin Server is temporarily out of service. The reason is due to the nature of the Admin Server’s responsibilities. When chain is broken by a brick failure, the remaining bricks must have their roles reconfigured to put the chain back into full service. The Admin Server is the only automated entity that is permitted to change the role of a brick. For more details, see:

  • xref:brick-lifecycle-fsm[]
  • xref:chain-lifecycle-fsm[], and
  • xref:chain-repair[].

[[admin-server-and-network-partition]] === Admin Server and Network Partition

One feature of the Erlang/OTP application controller is that it is not robust in event of a network partition. To prevent multiple Admin Server apps running simultaneously, another application is bundled with Hibari: the Partition Detector. See xref:partition-detector[] for an overview and explanation of the ‘A’ and ‘B’ physical networks.

As described briefly in xref:cap-theorem-and-hibari[], Hibari does support the “Partition tolerance” aspect of Eric Brewer’s CAP theorem. More specifically, if a network partition occurs, and a Hibari cluster is split into two or more pieces, not all clients on both/all sides of the network partition will be able to access Hibari services.

For the sake of discussion, we assume the cluster has been split into two fragments by a single partition, though any number of fragments may happen in real use. We also assume that nodes on both sides of the partition are configured in standby roles for the Admin Server.

If a network partition event happens, the following events will soon follow:

  • The OTP application controller for some/all central.conf-configured nodes will notice that communication with the formerly active Admin Server is now impossible.
  • Using internal logic, each application controller will make a decision of which standby node should move to active status.
  • Each active status node will start an instance of the Admin Server.

Note that all steps above will happen in parallel on nodes on both sides of the partition. If this situation is permitted to continue, the invariant of “Admin Server may only run on one node at a time” will be violated. However, with the help of the Partition Detector application, multiple Admin Server instances can be detected and halted.

UDP broadcasts on the ‘A’ and ‘B’ networks can help the Admin Server determine if it was restarted due to an Admin Server crash or by a network partition. In case of a network partition on network ‘A’, the broadcasts on network ‘B’ can indicate that another Admin Server process remains alive.

If multiple Admin Server instances are detected, the following logic is used:

  • If an Admin Server is in its “running” phase, then any other any Admin Server instance that is still in its “initialization” phase will halt.
  • If multiple Admin Server instances are all in the “initialization” phase, then only the Admin Server instance with the smallest name (in lexicographic sorting order) is permitted to run: all other instances will halt.

==== Importance of two physically separate networks

IMPORTANT: It is possible for both the ‘A’ and ‘B’ networks to partition simultaneously. The Admin Server and Partition Detector applications cannot always correctly react to such events. It is extremely important that the ‘A’ and ‘B’ networks be separate physical networks, including: separate physical network interfaces on each brick, separate cabling, separate network switches, and all other network-related equipment also be physically separate.

It is possible to reduce the reliance on multiple physical networks and the Partition Detector application, but such techniques have not been added to Hibari yet. Until an alternative network partition mitigation mechanism is implemented, we strongly recommend the proper configuration of the Partition Detector app and all of its hardware requirements.

=== Admin Server, Network Partition, and Client Access

When a network partition event occurs, there are two cases that affect a client’s ability to work with the cluster.

  • The client machine is on the same side of the partition as the Admin Server.
  • The client machine is on the opposite side of the partition as the Admin Server.

If the client machine is on the same side of the partition, the client may see no interruption of service at all. If the Admin Server is restarted in reaction to the partition event, there may be a small window of time (e.g. 20-30 seconds) where requests might fail because the Admin Server has not yet reconfigured chains on this side of the partition.

If the client machine is on the opposite side of the partition, then the client will not have access to the Admin Server and may not have access to properly configured chains. If a chain lies entirely entirely on the same side of the partition as the client, then the client can continue to use that chain successfully. However, any chain that is “cut in two” by the partition cannot support updates by any client.

Hibari System Information: Configuration Files, Etc.

Hibari’s system information is stored in one of two places. The first is the application configuration file, central.conf. By default, this file is stored in TODO/{version number}/etc/central.conf.

The second location is within Hibari server nodes themselves. This kind of configuration, stored inside the “bootstrap” bricks, makes it easy to share data with all nodes in the cluster.

Many of configuration values in central.conf will be the same on all nodes in a Hibari cluster. Given this reality, why not store those items in Hibari itself? The biggest problem comes when the application is first starting. See xref:bricks-outside-chain-replication[] for an overview of why it isn’t easy to store all configuration data inside Hibari itself.

In the future, it’s likely that many of the configuration items in the central.conf file will move to storage within Hibari itself.

=== central.conf File Syntax and Usage

Each line of the central.conf file has the form

parameter: value

where parameter is the name of the configuration option being set and value is the value that the configuration option is being set to.

Valid data types for configuration settings are INT (integer), STRING (string), and ATOM (one of a pre-defined set of option names, such as on or off). Apart from data type restrictions, no further valid range restrictions are enforced for central.conf parameters.

All time values in central.conf (such as delivery retry intervals or transaction timeouts) must be set as a number of seconds.

Blank lines and lines beginning with the pound sign (#) are ignored.

IMPORTANT: To apply changes that you have made to the central.conf file, you must restart the server. There are exceptions to this rule, but it’s one of the cleanup/janitor tasks to access central.conf using a standard set of APIs, e.g. always use the gmt_config_svr API.

[[central-conf-parameters]] === Parameters in the central.conf File

A detailed explanation of each of the items in central.conf can be found at link:../misc-files/central-conf.pdf[Hibari central.conf Configuration Guide].

=== Admin Server Configuration

Configuration for the Hibari ``Admin Server’’ is stored in three places:

. The central.conf file . The Schema.local file . Inside the ``bootstrap’’ bricks

[[admin-server-in-central-conf]] ==== Admin Server entries in the central.conf file

The following entries in the central.conf file are used by the Hibari Admin Server:

  • admin_server_distributed_nodes
** This option specifies which nodes in the Hibari cluster are
eligible to run the Admin Server. Hibari server nodes not included in this list cannot run the Admin Server.
** Active/standby service is provided by the Erlang/OTP platform’s
application management facility.
  • The Schema.local file
** This file provides a list of {logical brick, Hibari server node name}
tuples that store the Admin Server’s private state. Each brick name in this list starts with the prefix bootstrap_copy followed by an integer.
  • The ``bootstrap’’ bricks
** Each of these bricks store an independent copy of all Hibari
cluster state: table definitions, table -> chain mappings, start & stop history, etc.
** Data in each of the bootstrap bricks is not maintained by chain
replication. Rather, quorum-style replication is used. See xref:bricks-outside-chain-replication[].

=== Configuration Not Stored in Editable Config Files

All table and chain configuration parameters are stored within the Admin Server’s ``schema’‘. The schema contains information on:

  • Table names and options (e.g. blob values stored in RAM or on disk, sync/async disk logging)
  • Table -> chain mappings
  • Chain -> brick mappings

Much of this information can be seen in HTML form by pointing a Web browser at TCP port 23080 (default) of any Hibari server node. For example:

.Admin Server Top-Level Status & Admin URL
http://hibari-server-node-hostname:23080/

Your Web browser should be redirected automatically to the Admin Server’s top-level status & admin page.

NOTE: The APIs that expose this are, for the most part, already written. We need more “friendly” wrapper funcs as part of the “try this first” set of APIs for administration.

The Life of a (Logical) Brick

All logical bricks within a Hibari cluster go through the same set of lifecycle events. Each is described in greater detail in this section.

  • Brick initialization and operation states, described by a finite state machine.
  • Brick roles within chain replication, also described by a finite state machine.
  • Periodic housekeeping tasks performed by logical bricks and their internal support services, e.g. checkpoints and the ``scavenger’‘.

[[brick-lifecycle-fsm]] === Brick Lifecycle Finite State Machine

The lifecycle of each Hibari logical brick goes through a set of states defined by a finite state machine (OTP gen_fsm behavior) that is executed by a process within the Admin Server application.

.Logical brick lifecycle finite state machine svgimage::images/brick-fsm[align=”center”]

.Logical brick lifecycle FSM states unknown;;

This is the initial state of the FSM. Because the Admin Server may crash or be restarted at any time, this state is used by the Admin Server when it has not been running long enough to determine the state of the logical brick.
pre_init;;
A brick moves itself to this state when it has finished scanning its private write-ahead log (see xref:write-ahead-logs[]) and therefore knows the state of all keys that it manages.
repairing;;
In chain replication, the repairing state is used to synchronize a a newly started/restart brick with the rest of the chain. At the end of this state, the brick is 100% in sync with all other active members of the chain. Repair is initiated by the Admin Server’s chain monitor that is responsible for the chain.
ok;;

The brick moves itself to this state when repair has finished. The brick is now in service and capable of servicing Hibari client requests. Client requests will be rejected if the brick is in any other state. * If managed by chain replication, this brick is eligible to be put

into service as a full member of a replication chain. See xref:brick-roles[].
  • If managed by quorum replication, some external entity must change the logical brick’s state from pre_init -> ok. Hibari’s Admin Server automates this task for the `bootstrap_copy`* bricks. The present implementation of the Admin Server does not manage quorum replication bricks outside of the Admin Server’s private use.
disk_error;;
A disk error has occurred, for example a missing file or directory or MD5 checksum error. Administrator intervention is required to move a brick out of the disk_error state: shut down the entire Hibari server, kill the logical brick manually, or use the brick_chainmon:force_best_first_brick() function manually.

[[chain-lifecycle-fsm]] === Chain Lifecycle Finite State Machine

The chain FSM (OTP gen_fsm behavior) is executed by a process within the Admin Server application. All state transitions are triggered by changes in the state of each member bricks’ state, into or out of the ‘ok’ state. See xref:brick-lifecycle-fsm[] for details.

.Chain replication finite state machine svgimage::images/chain-fsm[align=”center”]

.Chain lifecycle FSM states unknown;;

The state of the chain is unknown. Information regarding chain members is unavailable. Because the Admin Server may crash or be restarted at any time, this state is used by the Admin Server when it has not been running long enough to determine the state of the chain. It is possible that the chain was in degraded or healthy state before the crash and therefore Hibari client operations may be serviced while in this state.
unknown_timeout;;
This intermediate state is used by the Admin Server before moving automatically to another state.
stopped;;
All bricks in the chain are crashed or believed to have crashed. Service to Hibari clients will be interrupted.
degraded;;
Some (but not all) bricks in the chain are in service. The Admin Server will wait for another chain member to enter its pre_init state before chain repair can start.
healthy;;
All bricks in the chain are in service.

[[brick-roles]] === Brick ``Roles’’ Within A Chain

Each brick within a chain has a role. The role will be changed by the Admin Server whenever it detects that the chain’s state has changed. These roles are:

head;;
The brick is first in the chain, i.e. at the ``head’’ of the chain’s ordered list of bricks.
tail;;
The brick is last in the chain, i.e. at the ``tail’’ of the chain’s ordered list of bricks.
middle;;
The brick is neither the ``head’’ nor ``tail’’ of the chain. Instead, the brick is somewhere in the middle of the chain.
standalone;;
In a chain of length 1, the ``standalone’’ brick is a brick that acts both as a ``head’’ and ``tail’’ brick simultaneously.

There is one additional attribute that is given to a brick in a cluster. Its name ``official tail’‘.

official tail;;
The official tail brick has two duties for the chain: * It handles read-only queries to the chain. * It sends replies to the client for all update operations that are sent to the head of the chain.

the chain. Hibari clients are not aware of “tail” bricks that are undergoing repair. Any client request that is sent to a repairing state brick will be rejected.

See xref:diagram-write-path-3[] for an example of a healthy chain of length three.

[[brick-init]] === Brick Initialization

A logical brick does not maintain an on-disk data structure, such as a binary tree or B-tree, to keep track of the keys it stores. Instead, each logical brick maintains that metadata entirely in RAM. Therefore, the only time that the metadata in the private write-ahead log is ever read is at brick initialization time, i.e. when the brick restarts.

The contents of the private write-ahead log are used to repopulate the brick’s ``key catalog’‘, the list of all keys (and associated metadata) stored by the brick.

When a logical brick is started, all of the log sequence files in the private log are read, starting from the oldest and ending with the newest. (See xref:wal-dirs-and-files[].) The total amount of data required at startup can be quite small or it can be hundreds of gigabytes. The factors that influence the amount of data in the private log are:

  • The total number of keys stored by the logical brick.
** More keys means that the log sequence file created by a checkpoint
operation will be larger.
  • The size of the brick_check_checkpoint_max_mb configuration parameter in the central.conf config file.

When the log scan is complete, construction of the brick’s in-RAM key catalog is finished.

See xref:checkpoints[] for details on brick checkpoint operations.

[[chain-repair]] === Chain Repair

When a chain is in the degraded state, new bricks that have entered their pre_init state can become eligible to join the chain. All new bricks are added to the end of the chain and undergo the chain repair process.

.Chain of length 2 in degraded state, a third brick under repair svgimage::images/read-write-path-3-repair[align=”center”, scaledwidth=”80%”]

The protocol used between upstream and downstream bricks is an iterative protocol that has two phases in a single iteration.

1. The upstream brick sends a subset of {Key, Timestamp} tuples downstream. * The downstream brick deletes keys from its key catalog that do not appear in the upstream’s subset. * The downstream brick replies with the list of keys that it does not have or have older timestamps. 2. The upstream bricks sends full information (all key metadata and value blobs) for all keys requested by the downstream in step #1. * The downstream brick acknowledges the new/replacement keys.

When the repair is finished, the Admin Server will change the roles of some/all chain members to make the repairing brick the new tail of the chain.

Only one brick may be repaired at one time. In theory it is possible to repair multiple bricks simultaneously, but the extra code complexity that would be required to do so has been judged to be too expensive (so far).

==== Chain reordering when moving from degraded -> healthy states

[[chain-reordering-middle-brick-fails]] .Chain order after a middle brick fails and is repaired (but not yet reordered) svgimage::images/chain-fail-repair-reorder[align=”center”, scaledwidth=”70%”]

After a middle brick fails and is repaired, the chain’s ordering is: brick 1 -> brick 3 -> brick 2. According to the algorithm in the original Chain Replication paper, the final chain ordering is expected. The Hibari implementation adds another step: reordering the chain.

For chains longer than length 1, when the Admin Server moves the chain from degraded -> healthy state, the Admin Server will reorder the the chain to match the schema’s definition for the healthy chain order. The assumption is that the Hibari administrator wishes the chain use a very specific order when it is in the healthy state. For example, if the chain’s workload were extremely read-intensive, the machine for logical brick #3 could have faster CPU or faster disks than the other bricks in the chain. To take full advantage of the extra capacity, the chain should be reordered as soon as possible.

However, it is not easy to reorder the chain. The replication of a client update during the reordering could get lost and violate Hibari’s strong consistency guarantees. The following algorithm is used to preserve consistency:

  1. Set all bricks to read-only mode.
  2. Wait for all updates to sync to disk at each brick and to progress downstream fully from head -> tail.
  3. Set brick roles to reflect the final desired order.

4. Set all bricks to read-write mode. ** Client do operations that contain updates will be resubmitted

(via the client-side API function brick_server:do()) to the cluster.

Typically, executing this algorithm takes less than one second. However, because the head brick is forced temporarily into read-only mode, client update requests will be delayed until read-only mode is turned off.

Client update requests submitted during read-only mode will be queued by the head brick and will be processed when read-only mode is turned off. Client read-only requests are not affected by read-only mode.

// JWN: I think it might be helpful to mention/ to explain (but maybe // not here) that Client updates may actually persist even though the // client stopped waiting and returned a timeout to the “application”. // A Timeout on Client updates can not guarantee the change was // applied or not applied to the Hibari tables.

[[checkpoints]] === Brick Checkpoint Operations

As updates are received by a brick, those updates are written to the brick’s private write-ahead log. During normal operations, private write-ahead log is write-only: the data there is only read at logical brick initialization time.

The checkpoint operation is used to reclaim disk space in the brick’s private write-ahead log. See xref:wal-dirs-and-files[] for a description of log sequence files and xref:central-conf-parameters[] for details on the central.conf configuration file.

.Brick checkpoint processing steps 1. When the total log size (i.e. total size of all log files in the

brick’s private log’s shortterm storage area) reaches the size of the brick_check_checkpoint_max_mb parameter in central.conf, a checkpoint operation is started. * Assume that the current log sequence file number is N.
  1. Two log sequence files are created, N+1 and N+2.
  2. Checkpoint data is written to log sequence number N+1.
  3. New updates by clients and chain replication are written to log sequence number N+2.
  4. Contents of the brick’s in-RAM key catalog are dumped to log sequence file N+1, subject to the bandwidth constraint of the brick_check_checkpoint_throttle_bytes configuration parameter.
  5. When the checkpoint is finished and flushed to disk, all log sequence files with a number less than or equal to N are deleted.

IMPORTANT: Each logical brick will checkpoint itself as its private log grows. It is possible that multiple logical bricks can schedule checkpoint operations simultaneously. The bandwidth limitation of the brick_check_checkpoint_throttle_bytes parameter is applied to the _sum of all writes by all checkpoint operations_.

[[scavenger]] === The Scavenger

As described in xref:write-ahead-logs[], all updates from all logical bricks are first written to the ``common log’‘. The most common of these updates are:

  • Metadata updates, e.g. key insert or key delete, by a logical brick.
  • A new value blob associated with a metadata update such as a Hibari

client set operation. ** This type is only applicable if the brick is configured to store value blobs on disk. This configuration is defined (by default) on a per-table basis and is then propagated to the chain and brick level by the Admin Server.

As explained in xref:write-ahead-logs[], the write-ahead log provides infinite storage at a logical level. But in the physical level, disk space must be reclaimed somehow. Because the common log is shared by multiple logical bricks, the technique described in xref:checkpoints[] cannot be used by the common log.

A process called the ``scavenger’’ is used to reclaim disk space in the common log. By default, the scavenger runs at 03:00 daily. The steps it executes are described below.

.Common log scavenger processing steps 1. For all bricks that store value blobs on disk, scan each logical brick’s in-RAM key catalog to create a list of all value blob storage locations. 2. Sort the value blob location list by log sequence number. 3. Identify all log sequence files with a “live data ratio” of at least X percent (default = 90%, see brick_skip_live_percentage_greater_than configuration parameter). 4. For all log files where live data ratio is less than *X*%, copy value blobs to new log sequence files. This copying is limited by the amount of bandwidth configured by brick_scavenger_throttle_bytes in central.conf. 5. When all blobs have been copied out of an old log sequence file and flushed to stable storage, update the storage locations in the in-RAM key catalog, then delete the old log sequence file.

ifdef::theme[] image:images/scavenger-techpubs.png[] endif::theme[] ifndef::theme[] image:images/scavenger-techpubs.png[width=”65%”] endif::theme[]

IMPORTANT: The value of the brick_skip_live_percentage_greater_than configuration parameter determines how much additional disk space is required to store X gigabytes of live data. If the parameter is N, then 100-N percent of all common log disk space may be wasted by storing dead data.

IMPORTANT: Additional disk space is required to log all updates that are made after the scavenger has run. This includes space in the common log as well as in each logical brick private logs (subject to the general limit of the brick_check_checkpoint_max_mb configuration parameter.

IMPORTANT: The current implementation of Hibari requires that plenty of disk space _always_ be available for write-ahead logs and for scavenger operations. We strongly recommend that the brick_scavenger_temp_dir configuration item use a different file system than the application_data_dir parameter. This directory stores temporary files required for sorting and other operations that would otherwise require large amounts of RAM.

Dynamic Cluster Reconfiguration

[[add-table]] === Adding a Table

A table can be added at any time, using either of two methods:

  • Use the Admin Server’s HTTP service: follow the “Add a table” hyperlink at the bottom of the top-level page.
  • Use the brick_admin CLI interface at the Erlang shell. See

link:hibari-contributor-guide.en.html#add-a-new-table[Hibari Contributor’s Guide, “Add a New Table” section].

[[remove-table]] === Removing a Table

NOTE: The current Hibari implementation does not support removing a table.

In theory, most of the work of removing a table is already done: chains that are abandoned after a migration are shut down * Brick pinger processes are stopped. * Chain monitor processes are stopped. * Bricks are stopped. * Brick data directories are removed.

All that remains is to update the Admin Server’s schema to remove references to the table.

[[chain-length-change]] === Changing Chain Length (Changing Replication Factor)

The Hibari Admin Server manages each chain as an independent data replication entity. Though Hibari clients view multiple chains that are associated with a single table, each chain is actually independent of the other chains. It is possible to change the length of one chain without changing any others. For long term operation, such differences do not make sense. But during short periods of cluster reconfiguration, such differences are possible.

A chain’s length is determined by specifying a list of bricks that are members of that chain. The order of the list specifies the exact chain order when the chain is in the healthy state. By adding or removing bricks from a chain definition, the length of the chain can be changed.

A chain is defined by the Erlang 2-tuple of {ChainName, ListOfBricks}, where each brick in ListOfBricks is a 2-tuple {BrickName, NodeName}. For example, a chain of length two called footab_ch1 could be defined as:

{footab_ch1, [{footab1_ch1_b1, 'gdss1@box-a‘}, {footab1_ch1_b1, 'gdss1@box-b‘}]}

The current definition of all chains for table TableName can be retrieved from the Admin Server using the brick_admin:get_table_chain_list() function, for example:

%% Get a list of all tables currently defined. > brick_admin:get_tables(). [tab1]

%% Get list of chains in ‘tab1’ as they are currently in operation. > brick_admin:get_table_chain_list(tab1). {ok,[{tab1_ch1,[{tab1_ch1_b1,’gdss1@machine-1‘},

{tab1_ch1_b2,’gdss1@machine-2‘}]},
{tab1_ch2,[{tab1_ch2_b1,’gdss1@machine-2‘},
{tab1_ch2_b2,’gdss1@machine-1‘}]}]}

This above chain list for table tab1 corresponds to the chain and brick layout below.

.Table tab1: Two chains of length two across two Erlang nodes on two physical machines svgimage::images/tab1-2x2[align=”center”, scaledwidth=”70%”]

NOTE: To change the definition of a chain, use the change_chain_length/2 or change_chain_length/3 functions. For documentation, see link:hibari-contributor-guide.en.html#changing-chain-length[Hibari Contributor’s Guide, “Changing Chain Length” section]

NOTE: When specifying a new chain definition, at least one brick from the current chain must be included.

// JWN: Is it dangerous to allow an admin the opportunity to NOT SPECIFY the head // of the chain in the new chain definition or to SPECIFY only a brick // that is under repair? I guess I see an opportunity for some // “dynamic” (and not just static) pre-conditions that should/could be // checked FIRST before starting to execute the changes.

[[chain-change-same-algorithm]] ==== Chain changes: same algorithm, different tasks.

The same brick repair technique is used to handle all three of the following cases:

  • adding a brick to a chain
  • brick failure
  • removing a brick from a chain

==== Adding a brick to a chain

When a brick B is added to a chain, that brick is treated as if it was a member of the chain that had crashed long ago and has now been restarted. The same repair algorithm is used to synchronize data on brick B that is used to repair bricks that were formerly in service but since crashed and restarted. See xref:chain-repair[] for a description of the Hibari repair mechanism.

==== Brick failure

If a brick fails, the Admin Server must remove it from the chain by reordering the chain. The general order of operations are:

  1. Set new roles for the chain’s bricks, starting from the end of the chain and working backward.
  2. Broadcast the new chain membership to all Hibari clients.

If a Hibari client attempts to send an operation to a brick during step #2 and before the new chain info from step #2 arrives, that client may send the operation to the wrong brick. Hibari servers will automatically forward the query to the correct brick. Due to network latencies and asynchronous message passing, it is possible that the query be forwarded multiple times before it arrives at the correct brick.

Specific details of how chain replication handles brick failure can be found in van Renesse and Schneider’s paper, see xref:chains[] for citation details.

===== Failure of a head brick

If the head brick fails, then the first middle brick is promoted to the head role. If there is no middle brick (i.e. the length of the chain was two), then the tail brick is promoted to a standalone role (chain length is one).

===== Failure of a tail brick

If the tail brick fails, then the last middle brick is promoted to the tail role. If there is no middle brick (i.e. the length of the chain was two), then the head brick is promoted to a standalone role (chain length is one).

[[failure-middle-brick]] ===== Failure of a middle brick

The failure of a middle brick requires the most complex recovery procedure.

  • Assume that the chain is three bricks: A -> B -> C.
** If the chain is longer (more bricks upstream of A and/or more
bricks downstream of C), the procedure remains the same.
  • Brick C is configured to have its upstream brick be A.
  • Brick A is configured to have its downstream brick be C.
  • The head of the chain (brick A or the head brick upstream of A) requests a log flush of all unacknowledged writes downstream. This step is required to re-send updates that were processed by A but have not been received by C because of middle brick B‘s failure.
  • Brick A waits until it receives a write acknowledgment from the tail of the chain. Once received, all bricks in the chain have synchronously written all items to their write-ahead logs in the correct order.

==== Removing a brick from a chain

Removing a brick B permanently from a chain is a simple operation. Brick B is handled the same way that any other brick failure is handled: the chain is simply reconfigured to exclude B. See xref:chain-reordering-middle-brick-fails[] for an example.

IMPORTANT: When a brick B is removed from a chain, all data from brick B will be deleted when the operation is successful. At this time, the API does not have an option to allow B‘s data to be preserved.

// JWN: Wah ... a typo could be very dangerous. Delayed deletion of // the data and/or some other protective mechanism could be helpful.

[[chain-migration]] === Chain Migration: Rebalancing Data Across Chains

There are several cases where it is desirable to rebalance data across chains and bricks in a Hibari cluster:

  • Chains are added or removed from the cluster
  • Brick hardware is changed, e.g. adding extra disk or RAM capacity
  • A change in a table’s consistent hashing algorithm configuration forces data (by definition) to another chain.

The same technique is used in all of these cases: chain migration. This mirrors the same design philosophy that’s used for handling chain changes (see xref:chain-change-same-algorithm[]): use the same algorithm to handle multiple use cases.

==== Example: Migrating from three chains to four

[[chain-migration-3to4]] .Chain migration from 3 chains to 4 chains svgimage::images/chain-migration-3to4[align=”center”, scaledwidth=”80%”]

In the example above, both the 3-chain and 4-chain configurations used equal weighting factors. When all chains use the same weighting factor (e.g. 100), then the consistent hashing map in the ``before’’ and ``after’’ cases look something like the figure below.

[[migration-3to4]] .Migration from three chains to four chains svgimage::images/migration-3to4[align=”center”, scaledwidth=”70%”]

It doesn’t matter that chain #4’s total area within the unit interval is divided into three regions. What matters is that chain #4’s total area is equal to the regions of the other three chains.

==== Example: Migrating from three chains to four with unequal weighting

The diagram xref:migration-3to4[] demonstrates how a migration would work when all chains have an equal weighting factor, e.g. 100. If instead, the new chain had a weighting factor of only 50, then the distribution of keys to each chain would look like this:

.Migration from three chains to four with unequal chain weighting factors [options=”header”] |========= | Chain Name | Total % of keys before/after migration | Total unit interval size before/after migration | Chain 1 | 33.3% -> 28.6% | 100/300 -> 100/350 | Chain 2 | 33.3% -> 28.6% | 100/300 -> 100/350 | Chain 3 | 33.3% -> 28.6% | 100/300 -> 100/350 | Chain 4 | 0% -> 14.3% (4.8% in each of 3 regions) | 0/300 -> 50/350 (spread across 3 regions) | Total | 100% -> 100% | 300/300 -> 350/350 |=========

For the original three chains, the total amount of unit interval devoted to those chains is (100+100+100)/350 = 300/350. The 4th chain, because its weighting is only 50, would be assigned 50/350 of the unit interval. Then, an equal amount of unit interval is taken from the original chains and reassigned to chain #4, so (50/350)/3 of the unit interval must be taken from each original chain.

==== Hotspot migration

With the lowest level API, it is possible to assign “hot” keys to specific chains, to try to balance a handful of keys that are very frequently accessed from a large number of keys that are very infrequently accessed. The table below gives an example that builds upon xref:migration-3to4[]. We assume that our “hot” key is mapped onto the unit interval at position 0.5.

.Consistent hashing lookup table with three chains of equal weight and a fourth chain with an extremely small weight [options=”header”] |========= | Unit interval start | Unit interval end | Chain name | 0.000000 | 0.333333... | Chain 1 | 0.333333... | 0.5 | Chain 2 | 0.5 | 0.500000000000001 | Chain 4 | 0.500000000000001 | 0.666666... | Chain 2 | 0.666666... | 1.0 | Chain 3 |=========

The table above looks almost exactly like the “Before Migration” half of xref:migration-3to4[]. However, there’s a very tiny “hole” that is punched in chain #2’s space that maps key hashes in the range of 0.5 to 0.500000000000001 to chain #4.

[[adding-removing-client-nodes]] === Adding/Removing Client Nodes

It is not strictly necessary to formally configure a list of all Hibari client nodes that may use a Hibari cluster. However, practically speaking, it is useful to do so.

To bootstrap itself to be able to use Hibari servers, a Hibari client must be able to:

  1. Communicate with other Erlang nodes in the cluster.
  2. Receive “global hash” information from the cluster’s Admin Server.

To solve both problems, the Admin Server maintains a list of Hibari client nodes. (Hibari server nodes do not need this mechanism.) For each client node, a monitor process on the Admin Server polls the node to see if the gdss or gdss_client application is running. If the client node is running, then problem #1 (connecting to other nodes in the cluster) is automatically solved by using net_adm:ping/1. Problem #2 is solved by the client monitor calling brick_admin:spam_gh_to_all_nodes/0.

The Admin Server’s client monitor runs approximately once per second, so there may be a delay of up to a couple of seconds before a newly-started Hibari client node is connected to the rest of the cluster and has all of the table info required to start work.

When a client node goes down, an OTP alarm is raised until the client is up and running again.

Two methods can be used to view and change the client node monitor list:

  • Use the Admin Server’s HTTP service: follow the “Add/Delete a client node monitor” hyperlink at the bottom of the top-level page.
  • Use the Erlang CLI to use these functions:

** brick_admin:add_client_monitor/1 ** brick_admin:delete_client_monitor/1 ** brick_admin:get_client_monitor_list/0

The Partition Detector Application

For multi-node Hibari deployments, Hibari includes a network monitoring feature that watches for partitions within the cluster, and attempts to minimize the database consequences of such partitions. This Erlang/OTP application is called the Partition Detector.

You can configure the network monitoring feature in the central.conf file. See xref:central-conf-parameters[] for details.

IMPORTANT: Use of this feature is mandatory for a multi-node Hibari deployment to prevent data corruption in the event of a network partition. If you don’t care about data loss, then as an ancient Roman might say, ``Caveat emptor.’’ Or in English, ``Let the buyer beware.’‘

For the network monitoring feature to work properly, you must first set up two separate networks, Network A and Network B, that connect to each of your Hibari physical bricks. The networks must be set up as follows:

  • Network A and Network B must be physically separate networks, with

different IP and broadcast addresses. See the diagram below for a two node cluster. * Network A must be the network used for all Hibari data communications. * Network A should have as few physical failure points as possible. For example, a single switch or load balancer is preferable to two switches cabled together. * The separate Network B will be used to compare node heartbeat patterns.

IMPORTANT: For the network partition monitor to work properly, your network partition monitor configuration settings must match as closely as possible. Each Hibari physical brick must have unique IP addresses on its two network interfaces (as required by all IP networks), but all configurations must use the same IP subnets for the ‘A’ and ‘B’ networks, and all configurations must use the same network ‘A’ tiebreaker.

[[a-and-b-network-diagram]] .Network ‘A’ and network ‘B’ diagram svgimage::images/a-and-b-diagram[align=”center”, scaledwidth=”80%”]

=== Partition Detector Heartbeats

Through the partition monitoring application, Hibari nodes send heartbeat messages to one another at the configurable heartbeat_beacon_interval, and each node keeps track of heartbeat history from each of the other nodes in the cluster. The heartbeats are transmitted through both Network A and Network B. If node gdss1@machine1 detects that the incoming heartbeats from gdss1@machine2 are absent both on Network A and on Network B, then gdss1@machine2 might have a problem. If the incoming heartbeats from gdss1@machine2 fail on Network A but not on Network B, a partition on Network A might be the cause. If heartbeats fail on Network B but not Network A, then Network B might have a partition problem, but this is less serious because Hibari data communication does not take place on Network B.

Configurable timers on each Hibari node determine the interval at which the absence of incoming heartbeats from another node is considered a problem. If on node gdss1@machine1 no heartbeat has been received from gdss1@machine2 for the duration of the configurable heartbeat_warning_interval, then a warning message is written to the application log of node gdss1@machine1. This warning message can be triggered by missing heartbeats either on Network A or on Network B; the warning message will indicate which node has not been heard from, and over which network.

=== Partition Detector’s Tiebreaker

If on node gdss1@machine1 no heartbeat has been received from gdss1@machine2 via Network A for the duration of the configurable heartbeat_failure_interval, and if during that period heartbeats from gdss1@machine2 continue to be received via Network B, then a network partition is presumed to have occurred in Network A. In this scenario, node gdss1@machine1 will attempt to ping the configurable network_a_tiebreaker address. If gdss1@machine1 successfully pings the tiebreaker address, then gdss1@machine1 considers itself to be on the “correct” side of the Network A partition, and it continues running. If by contrast gdss1@machine1 cannot successfully ping the tiebreaker address, then gdss1@machine1 considers itself to be on the “wrong” side of the Network A partition and shuts itself down. Meanwhile, comparable calculations and decisions are being made by node gdss1@machine2.

In a scenario where the network monitoring application determines that a partition has occurred on Network B – that is, heartbeats are received through Network A but not through Network B – then warnings are written to the Hibari nodes’ application logs but no node is shut down.

Backup and Disaster Recovery

=== Backup and Recovery Software

At the time of writing, Hibari’s largest cluster deployment is:

  • Well over 50 physical bricks
  • Well over 4TB of disk space per physical brick
  • Single data center, operated by a telecom carrier and integrated with third-party monitoring and control software

If a backup were made of all data in the cluster, the biggest question is, “Where would you store the backup?” Given the cluster’s purpose (real-time email/messaging services), the quality of the data center’s physical and software infrastructures, the length of the Hibari chains used for physical data redundancy, the business factors influencing the choice not to deploy a “hot backup” data center, and other factors, Cloudian has not developed the backup and recovery software for Hibari. Cloudian’s smaller Hibari deployments also resemble the largest deployment.

However, we expect that backup and recovery software will be high priorities for open source Hibari users. Together with the open source users and developers, we expect this software to be developed relatively quickly.

=== Disaster Recovery via Remote Data Centers

==== Single Hibari cluster spanning two data centers

It is certainly possible to deploy a single Hibari cluster across two (or more) data centers. At the moment, however, there is only one way of doing it: each chain of data replication must have a brick located in each data center.

As a consequence of brick placement, it is mandatory that Hibari clients pay the full round-trip latency penalty for each update. See xref:diagram-write-path-3[] for a diagram; the “head” and “tail” bricks would be in separate data centers, using WAN network connectivity between them.

For some applications, strong consistency is a higher priority than low latency (both for writes and possibly for reads, if the client is not co-located in the same data center as the chain’s tail brick). In those cases, such cross-data-center brick placement can make sense.

However, Hibari’s Admin Server cannot handle all failure scenarios, especially when WAN connectivity is broken between data centers; more programming work is required for the Admin Server to automate the handling of all processes. Furthermore, Hibari’s basic design cannot tolerate network partitions well, see xref:cap-theorem-and-hibari[] and xref:admin-server-and-network-partition[]. If the Admin Server were capable of handling WAN network partitions, it’s almost certain that all Hibari nodes in one of the partitioned data centers would be inactive.

==== Multiple Hibari clusters, one per data center

Conceptually, it’s possible to run multiple Hibari clusters, one per data center. However, Hibari does not have the software required for WAN-scale replication.

In theory, such software isn’t too difficult to develop. The tail brick of each chain can maintain a log of recent updates to the chain. Those updates can be transmitted asynchronously across a WAN to another Hibari cluster in a remote data center. Such a scheme is depicted in the figure below.

[[async-replication-try1]] .A future scenario of asynchronous, cross-data-center Hibari replication svgimage::images/async-replication-try1[align=”center”, scaledwidth=”80%”]

This kind of replication makes the most sense if “Data Center #1” were in an active role and Data Center #2” were in a hot-standby role. In that case, there would never be a “Data Center #2 Client”, so there would be no problem of strong consistency violations by clients accessing both Hibari clusters simultaneously. The only consistency problem would be one of durability: the replay of async update logs every N seconds would mean that up to N seconds of updates within “Data Center #1” could be lost.

However, if clients access both Hibari clusters simultaneously, then Hibari’s strong consistency guarantee would be violated. Some applications can tolerate weakened consistency. Other applications, however, cannot. For the those apps that must have strong consistency, Hibari will require additional design and code.

TIP: A keen-eyed reader will notice that xref:async-replication-try1[] is not fully symmetric. If clients in “Data Center #2” make updates to the chain, then the same async update log maintenance and replay to “Data Center #1” would also be necessary.

Hibari Application Logging

NOTE: This chapter is outdated and will be rewritten by Hibari v0.6 release. Hibari now uses link:https://github.com/basho/lager#readme[Basho Lager] for logging and the default location of the log files is: <HIBARI_HOME>/logs/

The Hibari application log records application-related alerts, warnings, and informational messages, as well as trace messages for debugging. By default the application log is written to this file:

<HIBARI_HOME>/var/log/gdss-app.log

=== Format of the Hibari Application Log

Each log entry in the Hibari application log is composed of these fields in this order, with vertical bar delimitation:

<PID>|<<ERLANGPID>>|<DATETIME>|<MODULE>|<LEVEL>|<MESSAGECODE>|<MESSAGE>

This Hibari application log entry format is not configurable. Each of these application log entry fields is described in the table that follows. The ``Position’’ column indicates the position of the field within a log entry.

[options=”header”,cols=”^,^m,<”] |========= | Position | Field | Description | 1 | <PID> | System-assigned process identifier (PID) of the process that generated the log message. | 2 | <ERLANGPID> | Erlang process identifier. | 3 | <DATETIME> | Timestamp in format %Y%m%d%H%M%S, where %Y = four digit year; %m = two digit month; %d = two digit date; %H = two digit hour; %M = two digit minute; and %S = two digit seconds. For example, 20081103230123. | 4 | <MODULE> | The internal component with which the message is associated. This field is set to a minimum length of 13 characters. If the module name is shorter than 13 characters, spaces will be appended to the module name so that the field reaches the 13 character minimum. | 5 | <LEVEL> | The severity level of the message. The level will be one of the following: ALERT, a condition requiring immediate correction; WARNG, a warning message, indicating a potential problem; INFO, an informational message indicating normal activity, and requiring no action; DEBUG, a highly granular, process-descriptive message potentially of use when debugging the application. | 6 | <MESSAGECODE> | Integer code assigned to all messages of severity level INFO or higher. NOTE: This code is not yet defined in the Hibari open source release. | 7 | <MESSAGE> | The message itself, describing the event that has occurred. |=========

=== Application Log Example

Items written to the Hibari application log come from multiple sources:

  • The Hibari OTP application
  • Other OTP applications bundled with Hibari
  • Other OTP applications within the Erlang runtime system, e.g. kernel and sasl.

The <MESSAGE> field is free-form text. Application code can freely add newline characters and various white-space padding wherever it wishes. However, the file format dictates that a newline character (ASCII 10) appear only at the end of the entire app log message.

The Hibari error logger must therefore reformat the text of the <MESSAGE> field to remove newlines and to remove whitespace padding. The result is not nearly as readable as the formatting presented to the Erlang shell. For example, within the shell, a message can look like this:

=PROGRESS REPORT==== 12-Apr-2010::17:49:22 ===
supervisor: {local,sasl_safe_sup}
started: [{pid,<0.43.0>},
{name,alarm_handler}, {mfa,{alarm_handler,start_link,[]}}, {restart_type,permanent}, {shutdown,2000}, {child_type,worker}]

Within the Hibari application log, however, the same message is reformatted as line #2 below. The reformatted version is much more difficult for a human to read than the version above, but the purpose of the app log file is to be machine-parsable, not human-parsable.

8955|<0.54.0>|20100412174922|gmt_app |INFO|2190301|start: normal [] 8955|<0.55.0>|20100412174922|SASL |INFO|2199999|progress: [{supervisor,{local,gmt_sup}},{started,[{pid,<0.56.0>},{name,gmt_config_svr},{mfa,{gmt_config_svr,start_link,[”../priv/central.conf”]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}] 8955|<0.55.0>|20100412174922|SASL |INFO|2199999|progress: [{supervisor,{local,gmt_sup}},{started,[{pid,<0.57.0>},{name,gmt_tlog_svr},{mfa,{gmt_tlog_svr,start_link,[]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}] 8955|<0.36.0>|20100412174922|SASL |INFO|2199999|progress: [{supervisor,{local,kernel_safe_sup}},{started,[{pid,<0.59.0>},{name,timer_server},{mfa,{timer,start_link,[]}},{restart_type,permanent},{shutdown,1000},{child_type,worker}]}] [...skipping ahead...] 8955|<0.7.0>|20100412174923|SASL |INFO|2199999|progress: [{application,gdss},{started_at,gdss_dev2@bb3}] 8955|<0.98.0>|20100412174923|DEFAULT |INFO|2199999|brick_sb: Admin Server not registered yet, retrying 8955|<0.65.0>|20100412174923|SASL |INFO|2199999|progress: [{supervisor,{local,brick_admin_sup}},{started,[{pid,<0.98.0>},{name,brick_sb},{mfa,{brick_sb,start_link,[]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}] 8955|<0.105.0>|20100412174924|DEFAULT |INFO|2199999|top of init: bootstrap_copy1, [{implementation_module,brick_ets},{default_data_dir,”.”}] 8955|<0.105.0>|20100412174924|DEFAULT |INFO|2199999|do_init_second_half: bootstrap_copy1 8955|<0.79.0>|20100412174924|SASL |INFO|2199999|progress: [{supervisor,{local,brick_brick_sup}},{started,[{pid,<0.105.0>},{name,bootstrap_copy1},{mfa,{brick_server,start_link,[bootstrap_copy1,[{default_data_dir,”.”}]]}},{restart_type,temporary},{shutdown,2000},{child_type,worker}]}] 8955|<0.105.0>|20100412174924|DEFAULT |INFO|2199999|do_init_second_half: bootstrap_copy1 finished

== Examining Latency in Production (Internal Event Tracing)

The Hibari source code has been annotated with over 400 tracepoints, and they give the developer and system administrator for tracing events through Hibari’s code. Those tracepoints are designed to be extremely lightweight and can be enabled in production environment without sacrificing performance.

Trace data can be collected via DTrace/SystemTap or Erlang’s tracing mechanism. For more details, please refer link:http://hibari.github.com/hibari-doc/hibari-contributor-guide.en.html#_hibari_internal_tracepoints[“Hibari internal tracepoints”] section of Hibari Contributor’s Guide.

Hardware and Software Considerations

As noted in xref:hibari-origins[], at the time of writing, Hibari has been deployed exclusively in data centers run by telecom carriers. All carriers have very specific requirements for integrating with its existing deployment, network monitoring, alarm management, and other infrastructures. As a result, many of those features have been omitted to date from Hibari. With Hibari’s release into an “open source environment”, we expect that these gaps will be closed.

Hibari’s carrier-centric heritage has also influenced the types of hardware, networking gear, operating system, support software, and internal Hibari configuration that have been used successfully to date. Some of these practices will change as Hibari evolves from its original use patterns. Until then, this section discusses some of the things that a systems/network administrator must consider when deploying a Hibari cluster.

Similarly, application developers must be very familiar with these same issues. An unaware developer can create an application that uses too many resources on under-specified hardware, causing problems for developers, support staff, and application users alike. We wish Hibari to grow and flourish in its non-relational DB niche.

[[brick-hardware]] === Notes on Brick Hardware

==== Lots of RAM is better

Each Hibari logical brick stores all information about its keys in RAM. Both the logical brick’s private write-ahead log and the common write-ahead log are not “disk-based data structures” in the typical sense, such as on-disk hash tables or B-trees. Therefore, Hibari bricks require a lot of RAM to function.

For more details, see:

  • xref:overview-high-performance[]
  • xref:per-table-config-perf-options[] ... if a table stores its value blobs in RAM, it will consume more RAM than if those value blobs are stored on disk.
  • xref:hibari-data-model[]
  • xref:brick-init[]

==== Lots of disk I/O capacity is better

By default, Hibari will write and flush each update to disk before sending a reply downstream or back to the client. Hibari will perform better on systems that have higher disk I/O capacity.

  • Non-volatile/battery-backed cache on the disk controller(s) is helpful, when combined with a write-back cache policy. The more cache, the better. If the read/write ratio of the cache can be changed, a default policy of 10/90 or 0/100 (i.e. skewed to writes) is typically more helpful than a default 50/50 split.
  • On-disk (volatile) cache on individual disks is not helpful.
  • Faster spinning disks are more helpful than slower spinning disks.
  • If using RAID, a large stripe width of e.g. 512KBytes or 1024KBytes is usually more helpful than the (usually) smaller default stripe width on most controllers.
  • If using RAID, a hardware RAID implementation may be very slightly helpful.
  • RAID redundancy (e.g. RAID 1, 10, 5, 6) is not required by Hibari, but it can help reduce the odds of failure of an individual physical brick. If physical bricks do not use data redundant RAID (e.g. RAID 0, concatenation), it’s a good idea to consider using longer replication chains to compensate.

For more details, see:

  • xref:the-physical-brick[]
  • xref:per-table-config-perf-options[]
  • xref:hibari-data-model[]

[[high-io-rate-devices]] ==== High I/O rate devices (e.g. SSD) may be used

Hibari has some support for high I/O rate devices such as solid state disks, flash memory disks, flash memory storage cards, et al. There is nothing in Hibari’s implementation that would preclude using high-speed disk devices as the only storage for Hibari write-ahead logs.

Hibari has a feature that can segregate high write I/O with fsync(2) operations onto a separate high-speed device, and use cheaper & lower-speed Winchester disk devices for bulk storage. This feature has not yet been well-tested and optimized.

For more details, see:

  • xref:write-ahead-logs[]
  • xref:two-wal-types[]

==== Lots of disk storage capacity may be a secondary concern

More disks of smaller capacity are almost always more helpful than a few disks of larger capacity. RAID 0 (no data redundancy) or RAID 10 (“mirror” data redundancy) is useful for combining the I/O capacity of multiple disks into a single logical volume. Other RAID levels, such as 5 or 6, can be used, though at the expense of higher write I/O overhead.

For more details, see:

  • xref:write-ahead-logs[]

[[considerations-cpu]] ==== Lots of CPU capacity is a secondary concern

Hibari storage bricks do not, as a general rule, require large amounts of CPU capacity. The largest single source of CPU consumption is in MD5 checksum calculation. If the data objects most commonly written & read by your application are small, then multi-socket, multi-core CPUs are not required.

Each Hibari logical brick is implemented within the Erlang virtual machine as a single gen_server process. Therefore, each logical brick can (generally speaking) only fully utilize one CPU core. If your Hibari cluster appears to have CPU-utilization imbalance, then the recommended strategy is to change the chain placement policy of the chains. For example, there are two methods for arranging a chain of length three across three physical bricks:

[[1-chain-striped-across-3-bricks]] The first example shows one chain striped across three physical bricks. If the read/write ratio for the chain is extremely high (i.e. most operations are reads), then most of the CPU activity (and perhaps disk I/O, if blobs are stored on disk) will be directed to the “Chain 1 tail” brick and cause a CPU utilization imbalance.

.One chain striped across three physical bricks
Physical Brick X | Physical Brick Y | Physical Brick Z |
Chain 1 head -> Chain 1 middle -> Chain 1 tail

[[3-chains-striped-across-3-bricks]] The second example shows the same three physical bricks but with three chains striped across them. In this example, each physical brick is responsible for three different roles: head, middle, and tail. Regardless of the read/write operation ratio, all bricks will utilize roughly the same amount of CPU.

.Three chains striped across three physical bricks
Physical Brick T | Physical Brick U | Physical Brick V |
Chain 1 head -> Chain 1 middle -> Chain 1 tail || Chain 2 tail || Chain 2 head -> Chain 2 middle -> Chain 3 middle -> Chain 3 tail || Chain 3 head ->

In multi-CPU and multi-core systems, a side-effect of using more chains (and therefore more bricks) is that the Erlang virtual machine can schedule more logical brick computation across a larger number of cores and CPUs.

=== Notes on Networking

Hibari works quite well using commodity “Gigabit Ethernet” interfaces. Lower latency (and higher cost) networking gear, such as Infiniband, is not required.

For production use, it is _strongly recommended_ that all Hibari servers be configured with two physical network interfaces, cabling, switches, etc. For more details, see:

  • xref:partition-detector[]

==== Client protocol load balancing

The native Erlang client, via the gdss or gdss_client OTP applications, do not require any load balancing. The Erlang client already is a participant in the consistent hashing algorithm (see xref:consistent-hashing-example[]). The Admin Server distributes updates to a table’s consistent hash map each time cluster membership or chain/brick status changes.

All other client access protocols are “dumb”, by comparison. Take for example the Amazon S3 protocol service. There is no easy way for a Hibari cluster to convey to a generic HTTP client how to calculate which brick to send a query to. The HTTP redirect mechanism could be used for this purpose, but other protocols don’t have an equivalent feature. Also, the latency overhead of sending a redirect is far higher than Hibari’s solution to this problem.

Hibari’s solution is simple: the Hibari server-side “dumb” protocol handler uses the same native Erlang client that any other Hibari client app written in Erlang users. That client is capable of making direct routing decisions. Therefore, the “dumb” protocol handler within a Hibari node acts as a translating proxy: it uses the “dumb” client access protocol on one side and uses the native Erlang client API on the other.

.Hibari “dumb” protocol proxy svgimage::images/dumb-protocol-proxy[align=”center”, scaledwidth=”80%”]

The deployed “state of the art” for such dumb protocols is to use a TCP load balancer (aka a “layer 4” load balancer) to spread dumb client workload across multiple Hibari dumb protocol servers.

=== Notes on Operating System

Hibari servers operate on top of the Erlang virtual machine. In principle, any operating system that is supported by the Erlang virtual machine can support Hibari.

==== Supported Operating Systems

In practice, Hibari is supported on the following operating systems:

  • Linux x86_64 ** Red Hat Enterprise Linux 5.x and 6.x (RHEL 5.3 is used in

    production and QA environments within Cloudian, Inc.)

    ** CentOS 5.x and 6.x ** Ubuntu 12.04 LTS or newer

  • Linux ARMv7 (32 bit) ** Ubuntu 12.04 LTS or newer ** Hibari runs on Calxeda EnergyCore based super high-density,

    scale-out cluster

  • Unix Solaris variants ** Joyent SmartOS (64 bit)

  • Mac OS X

  • FreeBSD (though not currently in a jail environment, due to some TCP services getting EPROTONOSUPPORT errors)

The versions recently tested for Hibari by the community:

  • CentOS 6.3 (x86_64)
  • Ubuntu 12.04 LTS (ARMv7)
  • Joyent SmartOS 20130221 (64 bit)

To take advantage of RAM larger than 4GB, we recommend that you use a 64-bit version of your OS’s kernel, 64-bit versions of the user runtime, and a 64-bit version of the Erlang/OTP runtime.

[[os-readahead-configuration]] ==== OS Readahead Configuration

Some operating systems have support for OS-based “readahead”: pre-fetching blocks of a file with the expectation that those blocks will soon be requested by the application. Properly configured, readahead can substantially raise throughput and reduce latency on many read-heavy I/O workloads.

The read I/O workloads for Hibari fall into two major categories:

  1. Extremely predictable sequential read-only I/O during brick initialization (see xref:brick-init[]).
  2. Extremely unpredictable random read I/O for fetching value blobs from disk.

The first I/O pattern can usually benefit a great deal from an aggressive readahead policy. However, an aggressive readahead policy can have the opposite effect on the second I/O pattern. Readahead policies under Linux, for example, are defined on a per-block device basis and does not change in response to application runtime behavior.

If your OS supports readahead policy configuration, we recommend using a small read and then measuring its effect with a real or simulated workload with the real Hibari server.

[[disk-scheduler-configuration]] ==== Disk Scheduler Configuration

We recommend that you experiment with disk scheduler configuration on relevant OSes such as Linux. The “deadline” scheduler is likely to provide better performance characteristics.

=== Notes on Supporting Software

A typical “server” type installation of a Linux or FreeBSD OS is sufficient for Hibari. The following is an incomplete list of other software packages that are necessary for Hibari’s installation and/or runtime.

  • NTP
  • Erlang/OTP version R13B04
  • Either “lynx” or “elinks”, a text-based Web browser

// JWN: This seems like a good place to mention patches that are // needed beyond R13B04 ... busy dist port?

[[ntp-config-strongly-recommended]] ==== NTP configuration of all Hibari server and client nodes

It is strongly recommended that all Hibari server and client nodes have the NTP daemon (Network Time Protocol) installed, properly configured, and running.

  • The brick_simple client API uses the OS clock for automatic generation of timestamps for each key update. The application problems caused by badly out-of-sync OS clocks can be easily avoided by NTP.
  • If a client’s clock is skewed by more than the brick_do_op_too_old_timeout configuration attribute in central.conf (units = milliseconds), then the brick will silently discard the client’s operation. The only symptoms of this are:
** Client-side timeouts when using the brick_simple, brick_server,
or brick_squorum APIs.

** Increasing n_too_old statistic counter on the brick.

=== Notes on Hibari Configuration

There are several reasons why disk I/O rates can temporarily increase within a Hibari physical brick:

  • Logical brick checkpoints for increased write I/O ops, see xref:checkpoints[]
  • The common log “scavenger” for increased read and write I/O ops, see xref:scavenger[]
  • Chain replication repair, see xref:chain-repair[]
** As the upstream/”repairer” brick, the extra read I/O ops,
if the brick stores value blobs on disk

** As the downstream/”repairee” brick, extra write I/O ops

The Hibari central.conf file contains parameters that can limit the amount of disk bandwidth used by most of these operations.

See also:

  • xref:considerations-cpu[]
  • xref:central-conf-parameters[]

=== Notes on Monitoring a Hibari Cluster

The Admin Server’s status page contains current status information regarding all tables, chains, and bricks in the cluster. By default, this service listens to TCP port 23080 and is reachable via HTTP at http://any-hibari-node-name:23080/. HTTP redirect will steer your browser to the Admin Server node.

  • Hypertext links for each table, chain, and brick can show more detailed info on each entity.
  • The “Dump History” link at the bottom of the Admin Server’s HTTP status page can show operations history across multiple bricks, chains, and/or tables by using the regular expression feature.
  • Each logical brick maintains counters of each type of Hibari client op primitive. At present, these stats are only exposed via the HTTP status server or by the native Erlang interface, but it’s possible to expose these stats via SNMP and other protocols in a straightforward manner.
** Stats include: number of add, replace, set, get,
get_many, delete, and micro-transactions.

==== Hibari Admin Server HTTP status

For example screen shots of the Admin Server status pages (a work in progress), see link:./misc-screenshots/admin-server-status/index.html[].

See also:

  • xref:chain-lifecycle-fsm[]
  • xref:brick-lifecycle-fsm[]
Administering Hibari Through the API
  • Add a new table
  • Delete a table
  • Change to a single chain:

** Add one or more bricks (increase replication factor) ** Remove one or more bricks (decrease replication factor) * Change to a single table. ** Add a new chain ** Remove a chain ** Change the chain weighting factor ** Change consistent hashing parameters

[[add-a-new-table]] === Add a New Table: brick_admin:add_table()

[[why-use-hash-prefixes]] ==== Why use hash prefixes?

Hash prefixes allow Hibari servers to guarantee the application developer that certain keys will always be stored on the same chain and therefore always on the same set of bricks. With this guarantee, an application aware of hash prefixes can use micro-transactions successfully.

For example, assume the application requires a collection of persistent stacks that are stored in Hibari.

  • Each stack is identified by a string/binary. (The two types are identical for the sake of discussion.)
  • Each item stored on the stack is a string.
  • Support stack options push & pop.
  • Support quick stack stats, e.g. # of elements on the stack and # of bytes stored on the stack.
  • Stacks may contain hundreds of thousands of items.
  • The total size of a stack will not exceed the total storage capacity of any single brick in the cluster.

IMPORTANT: Understanding the last assumption is vital. Because all keys with the same hash prefix H will be managed by the same chain C, then all bricks in C must have enough capacity to store all H prefix keys.

The application developer then makes the following decisions:

  1. The application will use a table devoted to storing stacks, called ‘stack’.
  2. We know that the application requires strong durability (which is the Hibari default) and that the sum total of all stack items will exceed a single brick’s RAM capacity. Therefore, the ‘stack’ table must store its value blobs on disk. Read access to the table will be slower than if value blobs were stored in RAM, but the limited RAM capacity of bricks does not give us a choice.
  3. We have two machines, boxA and boxB, available for hosting the table’s logical bricks. We want to be able to survive at least one physical brick failure, therefore all chains have a minimum length of 2.
** We will use two chains, so that each physical machine (when up and
running smoothly) will have 2 logical bricks for the table, one in the chain head role and one in the chain tail role.
** The naming scheme used for each chain name and brick name can be
arbitrary, as long as all names are unique. However, for ease-of-management purposes, the use of a systematic naming scheme is strongly encouraged. The scheme used here numbers each chain (starting at 1) and numbers each brick (also starting at 1) with both the chain and brick number.

4. We use the following key naming convention: ** A stack’s metadata (item count, byte count) uses <<”/StackName/md”>>. ** A item uses <<”/StackName/N”>> where N is the item number. 5. We create the table using the following: + ———————— Opts = [{hash_init, fun brick_admin:chash_init/3}, {prefix_method, var_prefix},

{num_separators, 2}, {prefix_separator, $/}, {new_chainweights, [{stack_ch1, 100}, {stack_ch2, 100}]}, {bigdata_dir, ”.”}, {do_logging, true}, {do_sync, true}].
ChainList = [{stack_ch1, [{stack_ch1_b1, hibari1@boxA},
{stack_ch1_b2, hibari1@boxB}]},
{stack_ch1, [{stack_ch2_b1, hibari1@boxB},
{stack_ch2_b2, hibari1@boxA}]}].
brick_admin:add_table(stack, ChainList, Opts).

See xref:examples-using-the-stack[] for sample usage code.

[[types-of-brick-admin-add-table]] ==== Types for brick_admin:add_table()

add_table(Name, ChainList, BrickOptions)
when is_atom(Name), is_list(ChainList) equivalent to add_table(brick_admin, Name, ChainList, BrickOptions)
add_table(ServerRef, Name, BrickOptions)
when is_atom(Name), is_list(BrickOptions) equivalent to add_table(ServerRef, Name, ChainList, [])
add_table(ServerRef::gen_server_serverref(), Name::table(),
ChainList::chain_list(), BrickOptions::brick_options())
-> ok |
{error, term()} | {error, term(), term()}

gen_server_serverref() = “ServerRef” type from STDLIB gen_server, gen_fsm, etc. proplists_property() = “Property” type from STDLIB proplists

bigdata_option() = {‘bigdata_dir’, string()} brick() = {logical_brick(), node()} brick_option() = chash_prop() |

custom_prop() | fixed_prefix_prop() | {‘hash_init’, fun/3} | var_prefix_prop()

brick_options() = [brick_option] chain_list() = {chain_name(), [brick()]} chain_name() = atom() chash_prop() = {‘new_chainweights’, chain_weights()} |

{‘num_separators’, integer()} | {‘old_float_map’, float_map()} | {‘prefix_is_integer_hack’, boolean()} | {‘prefix_length’, integer()} | {‘prefix_method’, ‘all’ | ‘var_prefix’ | ‘fixed_prefix’} | {‘prefix_separator’, integer()}

chain_weights() = [{chain_name, integer()}] custom_prop() = proplists_property() fixed_prefix_prop() = {‘prefix_is_integer_hack’, boolean()} |

{‘prefix_length’, integer()}

logging_option() = {‘do_logging’, boolean()} logical_brick() = atom() node() = atom() sync_option() = {‘do_sync’, boolean()} table() = atom() var_prefix_prop() = {‘num_separators’, integer()} |

{‘prefix_separator’, integer()}

{‘bigdata_dir’, string()}:: To store value blobs on disk (i.e. “big data” is true), specify this value with any string (the string’s actual value is not used). + IMPORTANT: To store value blobs in RAM, this option must be omitted. + {‘do_logging’, boolean()}:: Specify whether all bricks in the table will log updates to disk. If not specified, the default is true. + {‘do_sync’, boolean()}:: Specify whether all bricks in the table will synchronously flush all updates to disk before responding to the client. If not specified, the default is true. + {‘hash_init’, fun/3}:: Specify the hash initialization function. Of the four hash methods bundled with Hibari, we recommend using brick_hash:chash_init/3 only. + {‘new_chainweights, chain_weights()}:: (For brick_admin:chash_init/3) Specify the chainweights for this new table. For creating a new table, this option is not used. However, this option is used when changing a table to add/remove chains or to change other table-related parameters. + {‘num_separators’, integer()}:: (For brick_admin:chash_init/3 and brick_admin:var_prefix_init/3) For variable prefix hashes, this option specifies how many instances of the variable prefix separator character (see ‘prefix_separator’ below) are included in the hashing prefix. The default is 2. + For example, if {‘prefix_separator’, $/}, then + ** With {‘num_separators’, 2} and key <<”/foo/bar/baz/hello”>>,

the hashing prefix is <<”/foo/”>>.
** With {‘num_separators’, 3} and key <<”/foo/bar/baz/hello”>>,
the hashing prefix is <<”/foo/bar/”>>.

{‘old_float_map’, float_map()}:: Specify the old version of the “float map”. For creating a new table, this option is not used. However, this option is used when changing a table to add/remove chains or to change other table-related parameters: it is used to create a new mapping of {table, key} -> chain that relocates only a minimum number of keys a new chain. + {‘prefix_method’, ‘all’ | ‘var_prefix’ | ‘fixed_prefix’}:: (For brick_admin:chash_init/3) Specify which prefix method will be used for consistent hashing: + ** ‘all’: Use the entire key ** ‘var_prefix’: Use a variable-length prefix of the key ** fixed_prefix’: Use a fixed-length prefix of the key + {‘prefix_is_integer_hack’, boolean()}:: (For brick_admin:fixed_prefix_init/3) If true, the prefix should be interpreted as an ASCII representation of a base 10 integer for use as the hash calculation. + {‘prefix_length’, integer()}:: (For brick_admin:fixed_prefix_init/3) For a fixed-prefix hashes, this option specifies the prefix length. + {‘prefix_separator’, integer()}:: (For brick_admin:chash_init/3 and brick_admin:var_prefix_init/3) For variable prefix hashes, this option specifies the single byte ASCII value of the byte that separates the key’s prefix from the rest of the key. The default is $/, ASCII 47.

[[examples-using-the-stack]] ==== Examples code for using the stack

.Create a new stack

Val = #stack_md{count = 0, bytes = 0}. brick_simple:add(stack, “/new-stack/md”, term_to_binary(Val)). ————————————————

.Push an item onto a stack

{ok, OldTS, OldVal} = brick_simple:get(stack, “/new-stack/md”). #stack_md{count = Count, bytes = Bytes} = binary_to_term(OldVal). NewMD = #stack_md{count = Count + 1, bytes = Bytes + size(NewItem)}. ItemKey = “/new-stack/” ++ integer_to_list(Count). [ok, ok] = brick_simple:do(stack,

[brick_server:make_txn(),
brick_server:make_replace(“/new-stack/md”,
term_to_binary(NewMD), 0, [{testset, OldTS}]),

brick_server:make_add(ItemKey, NewItem)]).


.Pop an item off a stack

{ok, OldTS, OldVal} = brick_simple:get(stack, “/new-stack/md”). #stack_md{count = Count, bytes = Bytes} = binary_to_term(OldVal). ItemKey = “/new-stack/” ++ integer_to_list(Count - 1). {ok, _, Item} = brick_simple:get(stack, ItemKey). NumBytes = proplists:get_value(val_len, Ps). NewMD = #stack_md{count = Count - 1, bytes = Bytes - size(Item)}. [ok, ok] = brick_simple:do(stack,

[brick_server:make_txn(),
brick_server:make_replace(“/new-stack/md”,
term_to_binary(NewMD), 0, [{testset, OldTS}]),

brick_server:make_delete(ItemKey)]).

Item.

[[delete-a-table]] === Delete a Table

As yet, Hibari does not have a method to delete a table. The only methods available now are:

  • Delete all files and subdirectories from the bootstrap_* brick data directories, restart the Admin Server, and recreate all tables. (Also known as, “Start over”.)
  • Make a backup copy of all bootstrap_* brick data directories before creating a new table. If you wish to undo, then stop Hibari on all Admin Server-eligible nodes, remove the bootstrap_* brick data directories, restore the bootstrap_* brick data directories from the previous backup, then start all of the Admin Server-eligible nodes.

[[change-a-chain-add-remove-bricks]] === Change a Chain: Add or Remove Bricks

Adding or removing bricks from a single chain changes the replication factor for the keys stored in that chain: more bricks increases the replication factor, and fewer bricks decreases it.

.Data types for brick_admin:change_chain_length()

brick_admin:change_chain_length(ChainName, BrickList)

ChainName = atom() BrickList = [brick()]

brick() = {logical_brick(), node()} logical_brick() = atom() node() = atom() ————————————–

See also, xref:example-change-chain-length[brick_admin:change_chain_length() usage examples].

[[change-a-table-add-remove-chains]] === Change a Table: Add/Remove Chains

.Data types for brick_admin:start_migration()
brick_admin:start_migration(TableName, LH)
equivalent to brick_admin:start_migration(TableName, LH, [])

brick_admin:start_migration(TableName, LH, Options) -> {ok, cookie()} |

{‘EXIT’, term()}

TableName = atom() LH = hash_r() Options = migration_options()

cookie() = term() migration_option() = {‘do_not_initiate_serial_ack’, boolean()} |

{‘interval’, integer()} | {‘max_keys_per_chain’, integer()} | {‘max_keys_per_iter’, integer()} | {‘propagation_delay’, integer()}

migration_options() = [migration_option()]

brick_admin:chash_init(‘via_proplist’, ChainList, Options) -> hash_r()

ChainList = chain_list() Options = brick_options() ————————————–

See xref:types-of-brick-admin-add-table[] for definitions of chain_list() and brick_options() types.

The hash_r() type is an Erlang record, #hash_r as defined in the brick_hash.hrl header file. It is normally considered an opaque type that is created by a function such as brick_hash:chash_init/3.

NOTE: The options list passed in argument #3 to brick_admin:chash_init/3 is the same properties list that is used for brick_admin:add_table/3. The difference is that the options that are related strictly to brick behavior, such as the do_logging and do_sync properties, are ignored by chash_init/3.

Once a hash_r() term is created and brick_admin:start_migration/2 is called successfully, the data migration will start immediately.

The cookie() type is an opaque term that uniquely identifies the data migration that was triggered for the TableName table. Another data migration may not be triggered until the current migration has finished successfully.

The migration_option() properties are described below:

{‘do_not_initiate_serial_ack’, boolean()}:: For internal use only, do not use. + {‘interval’, integer()}:: Interval (in milliseconds) to send kick_next_sweep messages. Default = 50. + {‘max_keys_per_chain’, integer()}:: Maximum number of keys to send to any particular chain. Not yet implemented. + {‘max_keys_per_iter’, integer()}:: Maximum number of keys to examine per sweep iteration. Default = 500 for bricks with value blobs in RAM, 25 for bricks with value blobs on disk. + {‘propagation_delay’, integer()}:: Number of milliseconds to delay for each brick’s logging operation. Default = 0.

See also xref:changing-chains-example[].

[[change-a-table-chain-chain-weighting]] === Change a Table: Change Chain Weighting

The functions to change chain weighting are the same for adding/removing chains, see xref:change-a-table-add-remove-chains[] for additional details.

When creating a hash_r() type record, follow these two bits of advice:

  • The chain_list() term remains exactly the same as the chain list currently used by the table. See brick_admin:get_table_chain_list/1 for how to retrieve this list.
  • The new_chainweights property in the brick_options() list specifies a different set of chain weighting factors than is currently used by the table. The current chain weighting list is in the brick_options property returned by the brick_admin:get_table_info/1 function.

See also xref:changing-chains-example[].

[[admin-server-api]] === Admin Server API

See EDoc documentation for brick_admin.erl API.

[[scoreboard-api]] === Scoreboard API

See EDoc documentation for brick_sb.erl API.

[[chain-monitor-api]] === Chain Monitor API

See EDoc documentation for brick_chainmon.erl API.

[[changing-chain-length]] === Changing Chain Length: Examples

The Admin Server’s basic definition of a chain: the chains name, and the list of bricks. In turn, each brick is defined by a 2-tuple of brick name and node name.

Example chain definition, chain length=1
{tab1_ch1, [{tab1_ch1_b1, hibari1@bb3}]}

The function brick_admin:get_table_chain_list/1 will retrieve the active chain definition list for a table. For example, we retrieve the chain definition list for the table tab1. The node bb3 is the hostname of my laptop.

(hibari1@bb3)24> Tab1ChList. [{tab1_ch1,[{tab1_ch1_b1,hibari1@bb3}]}] ———-

NOTE: The brick_admin:get_table_chain_list/1 function will retrieve the active chain definition list for a table: only bricks that are in ok state will be shown. If a chain has a brick that has crashed, that brick will not appear in the list returned by this function. The brick_admin:get_table_info() function can fetch the list of all bricks, in service and crashed, but the API is not as convenient.

[[example-change-chain-length]] To change the chain length, use the brick_admin:change_chain_length/2 function. The arguments are the chain name and brick list.

NOTE: Any bricks in the brick list that aren’t in the chain are automatically started. Any bricks in the current chain that are not in the new list are halted, and their persistent data will be deleted.

// JWN: The deletion is not immediate on disk - correct? Scavenger is // needed - right?

ok

(hibari1@bb3)30> {ok, Tab1ChList2} = brick_admin:get_table_chain_list(tab1). {ok,[{tab1_ch1,[{tab1_ch1_b1,hibari1@bb3},
{tab1_ch1_b2,hibari1@bb3}]}]}

Now the tab1_ch1 chain has length two. We’ll shorten it back down to length 1.

(hibari1@bb3)32> {ok, Tab1ChList3} = brick_admin:get_table_chain_list(tab1). {ok,[{tab1_ch1,[{tab1_ch1_b2,hibari1@bb3}]}]} —————-

NOTE: A chain’s new brick list must contain at least one brick from the current chain’s definition. If the intersection of old brick list and new brick list is empty, the command will fail.


[[changing-chains-example]] === Creating and Rebalancing Chains: Examples

The procedure for creating new chains, deleting existing chains, and reweighing existing chains, and rehashing is done using the the brick_admin:start_migration() function. The chain definitions are specified in the same way as changing chain lengths, see xref:changing-chain-length[] for details.

The data structure required by brick_admin:start_migration/2 is more complex than the relatively-simple brick list that brick_admin:change_chain_length/2 requires. This section will demonstrate the creation of this structure, the ``local hash record’‘, step-by-step.

First, we create a new chain definition list. (Refer to xref:changing-chain-length[] if necessary.) For this example, we’ll assume that we’ll be modifying the tab1 table and that we’ll be adding two more chains. Each chain will be of length one. We’ll place each chain on the same node as everything else, hibari1@bb3 (i.e. my laptop).

(hibari1@bb3)49> NewCL = [{tab1_ch1, [{tab1_ch1_b1, hibari1@bb3}]},
{tab1_ch2, [{tab1_ch2_b1, hibari1@bb3}]}, {tab1_ch3, [{tab1_ch3_b1, hibari1@bb3}]}].
[{tab1_ch1,[{tab1_ch1_b1,hibari1@bb3}]},
{tab1_ch2,[{tab1_ch2_b1,hibari1@bb3}]}, {tab1_ch3,[{tab1_ch3_b1,hibari1@bb3}]}]

NOTE: Any bricks in the brick list that aren’t in a chain are automatically started. Any bricks in a current chains that are not in the chain definition are halted, and their persistent data will be deleted.

Next, we retrieve the table’s current hashing configuration. The data is returned to us in the form of an Erlang property list. (See the Erlang/OTP documentation for the proplists module, located in the “Basic Applications” area under “stdlib”.) We then pick out several properties that we’ll need later; we use lists:keyfind/3 instead of a function in the proplists module because it will preserve the properties in 2-tuple form, which will save us some typing effort later.

...lots of stuff omitted...

(hibari1@bb3)53> Opts = proplists:get_value(brick_options, TabInfo). [{hash_init,#Fun<brick_hash.chash_init.3>},

{old_float_map,[]}, {new_chainweights,[{tab1_ch1,100}]}, {hash_init,#Fun<brick_hash.chash_init.3>}, {prefix_method,var_prefix}, {prefix_separator,47}, {num_separators,3}, {bigdata_dir,”cwd”}, {do_logging,true}, {do_sync,true}, {created_date,{2010,4,17}}, {created_time,{17,21,58}}]

(hibari1@bb3)58> PrefixMethod = lists:keyfind(prefix_method, 1, Opts). {prefix_method,var_prefix}

(hibari1@bb3)59> NumSep = lists:keyfind(num_separators, 1, Opts). {num_separators,3}

(hibari1@bb3)60> PrefixSep = lists:keyfind(prefix_separator, 1, Opts). {prefix_separator,47}

(hibari1@bb3)61> OldCWs = proplists:get_value(new_chainweights, Opts). [{tab1_ch1,100}]

(hibari1@bb3)62> OldGH = proplists:get_value(ghash, TabInfo).

(hibari1@bb3)63> OldFloatMap = brick_hash:chash_extract_new_float_map(OldGH).

Next, we create a new property list.

(hibari1@bb3)72> NewOpts = [PrefixMethod, NumSep, PrefixSep,
{new_chainweights, NewCWs}, {old_float_map, OldFloatMap}].
[{prefix_method,var_prefix},

{num_separators,3}, {prefix_separator,47}, {new_chainweights,[{tab1_ch1,100},

{tab1_ch2,100}, {tab1_ch3,100}]}

{old_float_map, []}]


Next, we use the chain definition list, NewCL, and the table options list, NewOpts, to create a ``local hash’’ record. This record will contain all of the configuration information required to change a table’s consistent hashing characteristics.

...lots of stuff omitted...

[[chash-migration-pre-check]] We’re just one step away from changing the tab1 table. Before we change the table, however, we’d like to see how the table change will affect the data in the table. First, we add 1,000 keys to the tab1 table. Then we use the brick_simple:chash_migration_pre_check/2 function to tell us how many keys will move and to where.

ok,ok,ok,ok,ok,ok,ok,ok,ok,ok|...]

(hibari1@bb3)75> brick_simple:chash_migration_pre_check(tab1, NewLH). [{keys_before,[{tab1_ch1,1001}]},

{keys_keep,[{tab1_ch1,348}]}, {keys_moving,[{tab1_ch2,315},{tab1_ch3,338}]}, {keys_moving_where,[{tab1_ch1,[{tab1_ch2,315},

{tab1_ch3,338}]}]},

{errors,[]}]


The output above shows us that of the 1,001 keys in the tab1 table, 348 will remain in the tab1_ch1 chain, 315 keys will move to the tab1_ch2 chain, and 338 keys will move to the tab1_ch3 chain. That looks like what we want, so let’s reconfigure the table and start the data migration.

brick_admin:start_migration(tab1, NewLH).

Immediately, we’ll see a bunch of application messages sent to the console as new activities start:

  • A migration monitoring process is started.
  • New brick processes are started.
  • New monitoring processes are started.
  • Data migrations are started and finish
  • The migration monitoring process exits.

=GMT INFO REPORT==== 20-Apr-2010::00:26:41 === progress: [{supervisor,{local,brick_mon_sup}},

{started,
[{pid,<0.2937.0>},
{name,chmon_tab1_ch2}, ...stuff omitted...

[...lines skipped...] =GMT INFO REPORT==== 20-Apr-2010::00:26:41 === Migration monitor: tab1: chains starting

[...lines skipped...] =GMT INFO REPORT==== 20-Apr-2010::00:26:41 === brick_admin: handle_cast: chain tab1_ch2 in unknown state

[...lines skipped...] =GMT INFO REPORT==== 20-Apr-2010::00:26:52 === Migration monitor: tab1: sweeps starting

[...lines skipped...] =GMT INFO REPORT==== 20-Apr-2010::00:26:54 === Migration number 1 finished

[...lines skipped...] =GMT INFO REPORT==== 20-Apr-2010::00:26:57 === Clearing final migration state for table tab1 —————

For the sake of demonstration, now let’s see what brick_simple:chash_migration_pre_check() would say if we were to migrate from three chains to four chains.

(hibari_dev@bb3)25> Opts3 = proplists:get_value(brick_options, TabInfo3).

(hibari_dev@bb3)26> GH3 = proplists:get_value(ghash, TabInfo3).

(hibari_dev@bb3)28> OldFloatMap = brick_hash:chash_extract_new_float_map(GH3).

(hibari_dev@bb3)31> NewOpts4 = [PrefixMethod, NumSep, PrefixSep,
{new_chainweights, NewCWs4}, {old_float_map, OldFloatMap}].
(hibari_dev@bb3)35> NewCL4 = [ {tab1_ch1, [{tab1_ch1_b1, hibari1@bb3}]},
{tab1_ch2, [{tab1_ch2_b1, hibari1@bb3}]}, {tab1_ch3, [{tab1_ch3_b1, hibari1@bb3}]}, {tab1_ch4, [{tab1_ch4_b1, hibari1@bb3}]} ].

(hibari_dev@bb3)36> NewLH4 = brick_hash:chash_init(via_proplist, NewCL4, NewOpts4).

(hibari_dev@bb3)37> brick_simple:chash_migration_pre_check(tab1, NewLH4). [{keys_before,[{tab1_ch1,349},

{tab1_ch2,315}, {tab1_ch3,337}]},

{keys_keep,[{tab1_ch1,250},{tab1_ch2,232},{tab1_ch3,232}]}, {keys_moving,[{tab1_ch4,287}]}, {keys_moving_where,[{tab1_ch1,[{tab1_ch4,99}]},

{tab1_ch2,[{tab1_ch4,83}]}, {tab1_ch3,[{tab1_ch4,105}]}]},

{errors,[]}]


The output tells us that chain tab1_ch1 will lose 99 keys, tab1_ch2 will lose 83 keys, and tab1_ch3 will lose 105 keys. The final key distribution across the four chains would be 250, 232, 232, and 287 keys, respectively.

Introduction

Caution

This document is under re-construction – beware!

The Problem

There exists a dichotomy in modern storage products. Commodity storage is inexpensive, but unreliable. Enterprise storage is expensive, but reliable. Large capacities are present in both enterprise and commodity class. The problem, then, becomes how to leverage inexpensive commodity hardware to achieve high capacity enterprise class reliability at a fraction of the cost.

This problem space has been researched extensively, especially in the last few years: in academia, the commercial sector, and by open source community. Hibari uses techniques and algorithms from this research to create a solution which is reliable, cost effective, and scalable.

Key-Value Store

Hibari is key-value store. If a key-value store were represented as an SQL table, it would be defined as:

[[sql-definition-key-value]]

SQL-like definition of a generic key value store
CREATE TABLE foo (
    BLOB key;
    BLOB value;
) PRIMARY KEY key;

In truth, each key stored in Hibari has three additional fields associated with it. See xref:hibari-data-model[] and link:hibari-contributor-guide.en.html[Hibari Contributor’s Guide] for details.

[[hibari-origins]]

Hibari’s Origins

Hibari was originally written by Cloudian, Inc., formerly Gemini Mobile Technologies, to support mobile messaging and email services. Hibari was released outside of Cloudian under the Apache Public License version 2.0 in July 2010.

Hibari has been deployed by multiple telecom carriers in Asia and Europe. Hibari may lack some features such as monitoring, event and alarm management, and other “production environment” support services. Since telecom operator has its own data center support infrastructure, Hibari’s development has not included many services that would be redundant in a carrier environment.

We hope that Hibari’s release to the open source community will close those functional gaps as Hibari spreads outside of carrier data centers.

Summary of Hibari’s Main Features

  • A Hibari cluster is a distributed system.
  • A Hibari cluster is linearly scalable.
  • A Hibari cluster is highly available.
  • All updates are durable.
  • All updates are strongly consistent.
  • All client operations are lockless.
  • A Hibari cluster’s performance is excellent.
  • Multiple client access protocols are available.
  • Data is repaired automatically after a server failure.
  • Cluster configuration can be changed at any time.
  • Data is automatically rebalanced.
  • Heterogeneous hardware support is easy.
  • Micro-transactions simplify creation of robust client applications.
  • Per-table configurable performance options are available.

[[acid-base-hibari]]

The “ACID vs. BASE” Spectrum and Hibari

Important

We strongly believe that “ACID” and “BASE” properties exist on a spectrum and are not exclusively one or the other (black-or-white) properties.

Most database users and administrators are familiar with the acronym ACID: Atomic, Consistent, Independent, and Durable. Now, consider an alternative method of storing and managing data, BASE:

  • Basically available
  • Soft state
  • Eventually consistent

For an link:http://queue.acm.org/detail.cfm?id=1394128[exploration of ACID and BASE properties (at ACM Queue)], see:

BASE: An Acid Alternative Dan Pritchett ACM Queue, volume 6, number 3 (May/June 2008) ISSN: 1542-7730 http://queue.acm.org/detail.cfm?id=1394128

When both strict ACID and strict BASE properties are placed on a spectrum, they are at the opposite ends. However, a distributed database system can fit anywhere in the middle of the spectrum.

A Hibari cluster lies near the ACID end of the ACID/BASE spectrum. In general, Hibari’s design will always favors consistency and durability of updates at the expense of 100% availability in all situations.

[[cap-theorem-and-hibari]]

The CAP Theorem and Hibari

Warning

Eric Brewer’s “CAP Theorem”, and its proof by Gilbert and Lynch, is a tricky thing. It’s nearly impossible to cleanly apply the purity of logic to the dirty world of real, industrial computing systems. We strongly suggest that the reader consider the CAP properties as a spectrum, one of balances and trade-offs. The distributed database world is not black and white, and it is important to know where the gray areas are.

See the link:http://en.wikipedia.org/wiki/CAP_theorem[Wikipedia article about the CAP theorem] for a summary of the theorem, its proof, and related links.

CAP Theorem (postulated by Eric Brewer, Inktomi, 2000) Wikipedia http://en.wikipedia.org/wiki/CAP_theorem

Hibari chooses the C and P of CAP. It utilizes chain replication technique and it always guarantees strong consistency. Hibari also includes an Erlang/OTP application specifically for detecting network partitions, so that when a network partition occurs, the brick nodes in the opposite side of the partition with the active master will be removed from the chains to keep the strong consistency guarantee.

See xref:admin-server-and-network-partition[] for details.

Hibari’s Main Features in Broad Detail

=== Distributed system

Multiple machines can participate in a single cluster. The maximum size of a Hibari cluster has not yet been determined. A practical limit of approximately 200-250 nodes is likely.

Any server node can handle any client request, forwarding a request to the correct server node when necessary. Clients maintain enough state to send their queries directly to the correct server node in all common cases.

=== Scalable system

The total storage and processing capacity of a Hibari cluster increases linearly as machines are added to the cluster.

=== Durable updates

Every key update is written and flushed to stable storage (via the fsync() system call) before sending acknowledgments to the client.

=== Consistent updates

After a key’s update is acknowledged, no client in the cluster can see an older version of that key. Hibari uses the “chain replication” algorithm to maintain consistency across all replicas of a key.

All data written to disk include MD5 checksums; the checksums are validated on each read to avoid sending corrupted data to the client.

[[lockless-client-api]] === Lockless client API

The Hibari client API requires that all operations (read queries operations and/or update operations) be self-contained within a single client request. Therefore, locks are not implemented because they are not required.

Inside Hibari, each key-value pair also contains a ``timestamp’’ value. A timestamp is an integer. Each time the key is updated, the timestamp value must increase. (This requirement is enforced by all server nodes.)

In many database systems, if a client requires guarantees that a key has not changed since the last time it was read, then the client acquires a lock (or lease) on the key. In Hibari, the client’s update specifies the timestamp of the last read attempt of the key:

  • If the timestamp matches the server, the operation is permitted.
  • If the timestamp does not match the server’s timestamp, then the operation is not permitted, and the new timestamp is returned to the client.

It is recommended that all Hibari nodes use NTP to synchronize their system clocks. The simplest Hibari client API uses timestamps based upon the OS system clock for timestamp values. This feature can be bypassed, however, by using a slightly more complex client API.

However, Hibari’s overload detection and work-dumping algorithms will use the OS system clock, regardless of which client API is used. All system clocks, client and server, be synchronized to be within roughly 1 second of each other.

=== High availability

Each key can be replicated multiple times (configurable on a per-table basis). As long as one copy of the key survives, all operations on that key are permitted. A cluster can survive multiple cluster node failures and still maintain full data integrity.

The cluster membership application, called the Hibari Admin Server, runs as an active/standby application on one or more of the server nodes. The Admin Server’s configuration and private state are also maintained in Hibari server nodes. Shared storage such as NFS, shared SCSI/Fibre Channel LUNs, or replicated block devices are not required.

If the Admin Server fails and is restarted on a standby node, the rest of the cluster can continue normal operation. If another brick fails while the Admin Server is restarting, then clients may see service interruptions (usually in the form of timeouts) until the Admin Server has finished restarting and can react to the failure.

=== Multiple Client Protocols

Hibari supports many client protocols for queries and updates:

  • A native Erlang API, via Erlang’s native message-passing mechanism
  • Amazon S3 protocol, via HTTP
  • UBF, Joe Armstrong’s ``Universal Binary Format’’ protocol, via TCP
  • UBF via several minor variations of TCP transport
  • UBF over JSON-RPC, via HTTP
  • JSON-encoded UBF, via TCP

Protocols under development:

  • Memcached, via TCP
  • UBF over Thrift, via TCP
  • UBF over Protocol Buffers, via TCP

Most of the client access protocols are implemented using the Erlang/OTP application behavior. By separating each access protocol into separate OTP applications, Hibari’s packaging is quite flexible: packaging can add or remove protocol support as desired. Similarly, protocols can be stopped and started at runtime.

[[overview-high-performance]] === High performance

Hibari’s performance is competitive with other distributed, non-relational databases such as HBase and Cassandra, when used with similar replication and durability configurations. Despite the constraints of durable writes and strong consistency, Hibari’s performance can exceed those databases on some workloads.

IMPORTANT: The metadata of all keys stored by the brick, called the ``key catalog’‘, are stored in RAM to accelerate commonly-used operations. In addition, non-zero values of the “expiration_time” and non-empty values of “flags” are also stored in RAM (see xref:sql-definition-hibari[]). As a consequence, a multi-million key brick can require many gigabytes of RAM.

=== Automatic repair

Replicas of keys are automatically repaired whenever a cluster node crashes and restarts.

=== Dynamic configuration

The number of replicas per key can be changed without service interruption. Likewise, replication chains can be added or removed from the cluster without service interruption. This permits the cluster to grow (or shrink) as workloads change. See xref:chain-migration[] for more details.

=== Data rebalancing

Keys will be automatically be rebalanced across the cluster without service interruption. See xref:chain-migration[] for more details.

=== Heterogeneous hardware support

Each replication chain can be assigned a weighting factor that will increase or decrease the percentage of a table’s key space relative to all other chains. This feature can permit use of cluster nodes with different CPU, RAM, and/or disk capacities.

=== Micro-Transactions

Under limited circumstances, operations on multiple keys can be given transactional commit/abort semantics. Such micro-transactions can considerably simplify the creation of robust applications that keep data consistent despite failures by both clients and servers.

[[per-table-config-perf-options]] === Per-table configurable performance options

Each Hibari table may be configured with the following options to enhance performance ... though each of these options has a corresponding price to pay.

  • RAM-based storage: All data (both keys and values) may be stored in RAM, at the expense of increased RAM consumption. Disk is used still used to log all updates, to protect against a catastrophic power failure.
  • Asynchronous writes: Use of the fsync() system call can be disabled to improve performance, at the expense of data loss in a system crash or power failure.
  • Non-durable updates: All update logging can be disabled to improve performance, at the expense of data loss when all nodes in a replication chain crash.

Building A Hibari Database

=== Defining a Schema

Hibari is a key-value database. Unlike a relational DBMS, Hibari applications do not need to create a schema. The only application requirement is that all its tables be created in advance, see xref:creating-new-tables[] below.

[[hibari-data-model]] === The Hibari Data Model

If a Hibari table were represented within an SQL database, it would look something like this:

[[sql-definition-hibari]] .SQL-like definition of a Hibari table include::texts-src/hibari-sql-definition.txt[]

Hibari table names use the Erlang data type ``atom’‘. The types of all key-related attributes are presented below.

.Types of Hibari table key-value attributes include::texts-src/hibari-key-value-attrs.txt[]

include::texts-src/hibari-key-value-attrs-expl.txt[]

The practical constraints on maximum value blob size are affected by total blob size and frequency of large blob access. For example, storing an occasional 64MB value blob is different than a 100% write workload of 100% 64MB value blobs. The Hibari client API does not have a method to update or fetch less than the entire value blob, so a brick can be blocked for many seconds if it tried to operate on (for example) even a single 4GB blob. In addition, other processes can be blocked by ‘busy_dist_port’ events while processing big value blobs.

=== Hibari’s Client Operations

Hibari’s basic client operations are enumerated below.

add:: Set a key/value/expiration/flags only if the key does not already exist. delete:: Delete a key get:: Get a key’s timestamp and value get_many:: Get a range of keys replace:: Set a key/value/expiration/flags only if the key does exist set:: Set a key/value/expiration/flags txn:: Start of a micro-transaction

Each operation can be accompanied by operation-specific flags. Some of these flags include:

witness:: Do not return the value blob. (get, get_many) must_exist:: Abort micro-transaction if key does not exist. must_not_exist:: Abort micro-transaction if key does exist. {testset, TS}:: Perform the action only if the key’s current timestamp exactly matches TS. (delete, replace, set, micro-transaction)

For details of these operations and lesser-used per-operation flags, see:

  • xref:micro-transactions[]
  • link:hibari-contributor-guide.en.html[Hibari Contributor’s Guide]

=== Indexes

Hibari does not support automatic indexing of value blobs. If an application requires indexing, the application must build and maintain those indexes.

[[creating-new-tables]] === Creating New Tables

New tables can be created by two different methods:

  • Via the Admin Server’s status server. Follow the “Add a table” link at the bottom.
  • Using the Erlang shell.

For details on the Erlang shell API and detailed explanations of the table options presented in the Admin server’s HTTP interface, see the link:hibari-contributor-guide.en.html[Hibari Contributor’s Guide]

Hibari Architecture

From a logical point of view, Hibari’s architecture has three layers:

  • Top layer: consistent hashing
  • Middle layer: chain replication
  • Bottom layer: the storage brick

This section discusses each of these major layers in detail, starting from the bottom and working upward.

.Logical architecture diagram; physical hosts/bricks are color-coded with 5 colors svgimage::images/logical-architecture1[align=”center”, scaledwidth=”80%”]

.Logical architecture diagram, alternative perspective svgimage::images/logical-architecture-alt[align=”center”, scaledwidth=”80%”]

Bricks, Physical and Logical

The word “brick” has two different meanings in a Hibari system:

  • An entire physical machine that has Hibari software installed, configured, and (hopefully) running.
  • A logical software entity that runs inside the Hibari application that is responsible for managing key-value pairs.

[[the-physical-brick]]

The physical brick

The phrase “physical brick” and “machine” are interchangeable, most of the time. Hibari is designed to react correctly to the failure of any part of the machine that the Hibari application is running:

  • disk
  • power supply
  • CPU
  • network

Hibari is designed to take advantage of low-cost, off-the-self commodity servers.

A physical brick is the basic unit of failure. Data replication (via the chain replication algorithm) is responsible for protecting data, not redundant equipment such as dual power supplies and RAID disk subsystems. If a physical brick crashes for any reason, copies of data on other physical bricks can still be used.

It is certainly possible to decrease the chances of data loss by using physical bricks with more expensive equipment. Given the same number of copies of a key-value pair, the chances of data loss are less if each brick has multiple power supplies and RAID 1/5/6/10 disk. But risk of data loss can also be reduced by increasing the number of data replicas (“chain length”) using cheaper, non-redundant server hardware.

The logical brick

A logical brick is a software entity that runs within a Hibari application instance on a physical brick. A single Hibari physical brick can support dozens or (potentially) hundreds of logical bricks, though limitations of CPU, RAM, and/or disk capacity can impose a smaller limit.

A logical brick maintains RAM and disk data structures to store a collection of key-value pairs. The keys are maintained in lexicographic sorting order.

The replication technique used by Hibari, chain replication, maintains identical copies of key-value pairs across multiple logical bricks. The number of copies of a key-value pair is exactly equal to the length of the chain. See the next subsection below for more details.

It is possible to configure Hibari to place all of the logical bricks for the same chain onto the same physical brick. This practice can be useful in a developer’s environment, but it is impractical for production networks: such a configuration does not have any physical redundancy, and therefore it poses a greater risk of data loss.

[[write-ahead-logs]]

Write-Ahead Logs

By default, all logical bricks will record all updates to a write-ahead log. Used by many database systems, a write-ahead log (WAL) appears to be an infinitely-sized log where all important events (e.g. all write and delete operations) are appended to the end of the log. The log is considered write-ahead if a log entry is written prior to any significant processing by the application.

[[write-ahead-logs-in-hibari]]

Write-ahead logs in the Hibari application

Two types of write-ahead logs are used by the Hibari application. These logs cooperate with each other to provide several benefits to the logical brick.

There are two types of write-ahead logs:

  • The shared common log. This single write-ahead log instance provides durability guarantees to all logical bricks within the server node via the fsync() system call.
  • Individual private logs. Each logical brick maintains its own private write-ahead log instance. All metadata regarding keys in the logical brick are stored in the logical brick’s private log.

All updates are written first to the common log, usually in a synchronous manner. At a later time, update metadata is lazily copied from the common log to the corresponding brick’s private log. Value blobs (for bricks that store value blobs on disk) will remain in the common log and are managed by the scavenger, see xref:scavenger[].

svgimage::images/private-and-common-logs[align=”center”, scaledwidth=”80%”]

[[two-wal-types]]

Two types of write-ahead logs

The two log types cooperate to support a number of useful properties.

  • Data durability in case of system crash or power failure. All synchronous writes to the ``common log’’ are guaranteed to be flushed to stable storage.
  • Performance enhancement by limiting fsync() usage. After a logical brick writes data to the common log, it will request an fsync(). The common log will combine fsync() requests from multiple bricks into a single system call.
  • Performance enhancement at logical brick startup. A brick’s private log stores only that bricks key metadata. Therefore, at startup time, the logical brick does not scan data maintained by other logical bricks. This can be a very substantial time savings as the amount of metadata managed by all logical bricks grows over time.
  • Performance enhancement by separating synchronous writes from asynchronous writes. If the common log’s storage is on a separate device, e.g. a write-optimized flash memory block device, then all of the fsync() calls can finish much faster. During later processing of the asynchronous/lazy copying of key metadata from the common log to individual private logs can take advantage of OS dirty page write coalescing and other I/O optimizations without interference by fsync(). These copies are performed roughly once per second.

[[wal-dirs-and-files]]

Directories and files used by write-ahead logs

Each write-ahead log is stored on disk as a collection of large files (default = 100MB each). Each file in the log is identified by a log sequence number and is called a log sequence file.

Log sequence files are append-only and are never written again. Consequently, data in a log sequence file is never overwritten. Any disk space reclaimed by checkpoint and scavenger operations is done by copying data from old log sequence files and appending to new log sequence files. Once the new log sequence file(s) is flushed to stable storage, the old log sequence file(s) can be deleted.

When a log sequence file reaches its maximum size, the current log file is closed and a new one is opened with a monotonically increasing log serial number.

All log files for a write-ahead log are grouped under a single directory called, hlog.{log-name}, where {log-name} is the name of the brick or of the common log. These directories are stored under the var/data subdirectory of the application’s installation path, /usr/local/TODO/TODO/var/data (by default).

The maximum log file size (brick_max_log_size_mb in the central.conf file) is advisory only and is not enforced as a hard limit.

Reclaiming disk space used by write-ahead logs

In practice, infinite storage is not yet available. The Hibari system uses two mechanisms to reclaim unused disk space:

  • The checkpoint mechanism, see xref:checkpoints[].
  • The scavenger mechanism, see xref:scavenger[].
Write-ahead log serial numbers

Each item written in a write-ahead log is assigned a serial number. If the brick is in standalone or head roles, then the serial number will be assigned by that brick. For downstream bricks, the serial number assigned by the head brick will be used.

The serial number mechanism is used to ensure that a single unique ordering of log items will be written to each brick log. In certain failure cases, log items may be re-sent down the chain a second time, see xref:failure-middle-brick[].

// JWN: Does the above mechanism “to ensure that a single unique ordering” // applies to both common log and private log?

[[chains]] === Chains

A chain is the unit of data replication used by the link:http://www.usenix.org/events/osdi04/tech/renesse.html[``chain replication’’ technique as described in this paper]:

Chain Replication for Supporting High Throughput and Availability Robbert van Renesse and Fred B. Schneider USENIX OSDI 2004 conference proceedings http://www.usenix.org/events/osdi04/tech/renesse.html

Data replication algorithms can be separated into two basic families:

  • State machine replication
  • Quorum replication

The chain replication algorithm is from the state machine family of replication algorithms. It is a variation of the familiar ``master/slave’’ replication algorithm, where all updates are sent to a master node and then copies are sent to zero or more slave nodes.

Chain replication requires a very specific ordering of nodes (which store copies of data) and the messages passed between them. The diagram below depicts the “key update” message flow in a chain of length three.

[[diagram-write-path-3]] .Message flow in a chain for a key update svgimage::images/write-path-3[align=”center”, scaledwidth=”80%”]

If a chain is of length one, then the same brick assumes both ``head’’ and ``tail’’ roles simultaneously. In this case, the brick is called a ``standalone’’ brick.

.Message flow for a key update to a chain of length 1 svgimage::images/write-path-1[align=”center”, scaledwidth=”30%”]

To maintain the property strong consistency, a client must read data from the tail brick in the chain. A read processed by any other brick member would permit the client to read an update that has not yet been processed by all bricks and therefore could result in a strong consistency violation. Such a violation is frequently called a ``dirty read’’ in other database systems.

.Message flow for a read-only key query svgimage::images/read-path-3[align=”center”, scaledwidth=”80%”]

[[bricks-outside-chain-replication]] ==== Bricks outside of chain replication

During Hibari’s development, we encountered a problem of managing the state required by the Admin Server. If data managed by chain replication requires the Admin Server to be running, how can the Admin Server read its own data? There is a ``chicken and the egg’’ dependency problem that must be solved.

// JWN: Why wasn’t Mnesia used for the Admin Server’s storage // implementation?

The solution is simple: do not use chain replication to manage the Admin Server’s data. Instead, that data is replicated using a simple ``quorum replication’’ technique. These bricks all have names starting with the string “bootstrap”.

A brick must be in ``standalone’’ mode to answer queries when it is used outside of chain replication. See xref:brick-roles[] for details on the standalone role.

=== Tables

A table is thing that divides the key namespace within Hibari. If you need to have two different keys called “foo” but have different values, you store each “foo” key in a separate table. The same is true in other database systems.

Hibari’s implementation uses one or more replication chains to store the data for one table.

.Relationship between tables, chains, and bricks. svgimage::images/table-chain-brick[align=”center”, scaledwidth=”70%”]

[[micro-transactions]] === Micro-Transactions

In a single request, a Hibari client may send multiple update operations to the cluster. The client has the option of requesting ``micro-transaction’’ semantics for those updates: if there are no errors, then all updates will be applied atomically. This behaves like the ``transaction commit’’ behavior supported by most relational databases.

On the other hand, if there is an error while processing one of the update operations, then all of update operations will fail. This behaves like the ``transaction abort’’ behavior supported by most relational databases.

Unlike most relational databases, Hibari does not have a transaction manager that can coordinate ACID semantics for arbitrary read and write operations across any row in any table. In fact, Hibari has no transaction manager at all. For this reason, Hibari calls its limited transaction feature ``micro-transactions’‘, to distinguish this feature from other database systems.

Hibari’s micro-transaction support has two important limitations:

  • All keys involved in the transaction must be stored in the same replication chain (and therefore by the same brick(s)).
  • Operations within the micro-transaction cannot see updates by other operations within the the same micro-transaction.

.Four keys in the “footab” table, distributed across two chains of length three. [id=”footab-example”] svgimage::images/micro-transaction-example[align=”center”, scaledwidth=”70%”]

In the diagram above, a micro-transaction can be permitted if it operates on only the keys “string1” & “string4” or only the keys “string2” and “string3”. If a client were to send a micro-transaction that operates on keys “string1” and “string3”, the micro-transaction will be rejected: key “string3” is not stored by the same chain as the key “string1”.

.Valid micro-transaction: all keys managed by same chain [id=”valid-utxn”]

[txn,
{op = replace, key = “string1”, value = “Hello, world!”}, {op = delete, key = “string4”}

]

.Invalid micro-transaction: keys managed by different chains [id=”invalid-utxn”]

[txn,
{op = replace, key = “string1”, value = “Hello, world!”}, {op = delete, key = “string2”}

]

The client does not have direct control over how keys are distributed across chains. When a table is defined and created, its configuration specifies the algorithm used to map a {TableName, Key} pair to a specific chain.

// JWN: This might be a good place to briefly explain the benefits of // using a key prefix and how it is beneficial to (some) applications.

NOTE: See link:hibari-contributor-guide.en.html#add-a-new-table[Hibari Contributor’s Guide, “Add a New Table” section] for more information about table configuration.

=== Distribution: Workload Partitioning and Fault Tolerance

[[consistent-hashing-example]] ==== Partitioning by consistent hashing

To spread computation and storage workloads across all servers in the cluster, Hibari uses a technique called ``consistent hashing’‘. This hashing technique attempts to distribute a table’s key space evenly across all chains used by that table.

IMPORTANT: The word ``consistent’’ has slightly different meanings relative to ``consistent hashing’’ and ``strong consistency’‘. The consistent hashing algorithm is a commonly-used algorithm for key -> storage location calculations. Consistent hashing does not affect the ``eventual consistency’’ or ``strong consistency’’ semantics of a database system.

See the xref:footab-example[] for an example of a table with two chains.

See link:hibari-contributor-guide.en.html#add-a-new-table[Hibari Contributor’s Guide, “Add a New Table” section] for details on valid options when creating new tables.

===== Consistent hashing algorithm

Hibari uses the following steps in its consistent hashing algorithm implementation:

  • Calculate the ``hashing prefix’‘, using part or all of the key as input to the next step.
** This step is configurable, using built-in functions or by providing
a custom implementation function.

** Built-in prefix functions: * Null: use entire key * Fixed length, e.g. 4 byte or 8 byte constant length prefix. *** Variable length: use separator character ‘/’ (configurable)

such that hash prefix is found between the first two (also configurable) ‘/’ characters. E.g. If the key is /user/bar, then the string /user/ is used as the hash prefix.
  • Calculate the MD5 checksum of the hashing prefix and then convert the result to the unit interval, 0.0 - 1.0, using floating point arithmetic.
  • Consult the unit interval -> chain map to calculate the chain name.
** This map contains a tree of {StartValue, EndValue, ChainName} tuples.
For example, {0.0, 0.5, footab_ch1} will map the interval (0.0, 0.5] to the chain named footab_ch1.
*** The mapping tree’s construction is affected by the chain weighting
factor. The weighting factor allows some chains to store more than other chains.
  • Use the operation type to calculate the brick name.

** For read-only operations, choose the tail brick. ** For update operations, choose the head brick.

===== Consistent hashing algorithm use within the cluster

  • Hibari clients use the algorithm to calculate which chain must

handle operations for a key. Clients obtain this information via updates from the Hibari Admin Server. These updates allow the client to send its request directly to the correct server in most use cases.

  • Servers use the algorithm to verify that the client’s calculation

was correct. ** If a client sends an operation to the wrong brick, the brick will forward the operation to the correct brick. ** If a client sends a list of operations such that some bricks are stored on the brick and other keys are not, an error is returned to the client. Micro-transactions are not supported across chains.

===== Changing consistent hashing configuration dynamically

Hibari’s Admin Server will allow changes to the consistent hashing algorithm without service interruption. Such changes are applied on a per-table basis:

  • Adding or removing chains to the unit interval -> chain map.
  • Modifications of the chain weighting factor.
  • Modifying the key -> hashing prefix calculation function.

See the xref:chain-migration[] section for more information.

==== Multiple replicas for fault tolerance

For fault tolerance, data replication is required. As explained in xref:chains[], the basic unit of failure is the brick. The chain replication algorithm will maintain replicas of keys in a strongly consistent manner across all bricks: head, middle, and tail bricks.

To be able to tolerate F failures without data loss or service interruption, each replication chain must be at least F+1 bricks long. This is in contrast to quorum replication family algorithms, which typically require 2F+1 replica bricks.

// JWN: Would it be helpful to put a note that typically “3” is the // recommended number of replicas?

===== Changing chain length configuration dynamically

Hibari’s Admin Server will allow changes to a chain’s length without service interruption. Such changes are applied on a per-chain basis. See the xref:chain-length-change[] section for more information.

The Admin Server Application

The Hibari ``Admin Server’’ is an OTP application that runs in an active/standby configuration within a Hibari cluster. The Admin Server is responsible for:

  • Monitoring the health of each brick in the cluster, see xref:brick-lifecycle-fsm[].
  • Monitoring the status of each chain in the cluster, see xref:chain-lifecycle-fsm[].
  • Managing administrative changes of chain -> brick mappings, see xref:chain-length-change[].
  • Managing data rebalancing, see xref:chain-migration[].
  • Communicating cluster status to Hibari client nodes.
  • Other administrative tasks, such as the creation of new tables.

Only one instance of the Admin Server is permitted to run within the cluster at a time. The Admin Server runs in an ``active/standby’’ configuration that is used in many high-availability clustered applications. The nodes that are eligible to participate in the active/standby configuration are configured via the main Hibari configuration file; see xref:admin-server-in-central-conf[] and xref:central-conf-parameters[] for more details.

=== Admin Server Active/Standby Implementation

The active/standby application failover is handled by the Erlang/OTP application controller. No extra third-party software is required. See Chapter 7, “Applications”, and Chapter 9, “Distributed Applications”, in the “OTP Design Principles User’s Guide” at http://www.erlang.org/doc/design_principles/distributed_applications.html.

[[bootstrap-bricks]] === Admin Server’s Private State: the Bootstrap Bricks

On each active and standby node, there is a hint file called Schema.local which contains the name of the ``bootstrap bricks’‘. These bricks operate outside of the chain replication algorithm to provide redundant, persistent state for the Admin Server application. See xref:bricks-outside-chain-replication[] for a short summary of standalone bricks.

All of the Admin Server’s private state is stored in the bootstrap bricks. This includes:

  • All table definitions and their configuration, e.g. consistent hashing parameters.
  • Status of all bricks and all chains.
  • Operational history of all bricks and all chains.

With the help of the Erlang/OTP application controller and the Hibari Partition Detector application, only a single instance of the Admin Server is permitted to run at any one time. That single application instance has full control over the data stored in the bootstrap bricks and therefore does not have to manage concurrent updates to bootstrap brick data.

=== Admin Server Crash and Restart

When the Admin Server application is stopped (e.g. node shutdown) or crashes (e.g. software bug, power failure), all of the tasks outlined at the beginning of xref:admin-server-app[] are halted. In theory, the 20-30 seconds that are required for the Admin Server to restart could mean 20-30 seconds of negative service impact to Hibari clients.

In practice, however, Hibari clients almost never notice when an Admin Server instance crashes and restarts. Hibari clients do not need the Admin Server when the cluster is stable. The Admin Server is only necessary when the state of the cluster changes. Furthermore, as far as clients are concerned, clients are only affected when bricks crash. Other cluster change events, such as when chain replication repair finished, do not directly impact clients and thus can wait for the Admin Server to finish restarting.

A Hibari client will only notice an Admin Server crash if another logical brick crashes while the Admin Server is temporarily out of service. The reason is due to the nature of the Admin Server’s responsibilities. When chain is broken by a brick failure, the remaining bricks must have their roles reconfigured to put the chain back into full service. The Admin Server is the only automated entity that is permitted to change the role of a brick. For more details, see:

  • xref:brick-lifecycle-fsm[]
  • xref:chain-lifecycle-fsm[], and
  • xref:chain-repair[].

[[admin-server-and-network-partition]] === Admin Server and Network Partition

One feature of the Erlang/OTP application controller is that it is not robust in event of a network partition. To prevent multiple Admin Server apps running simultaneously, another application is bundled with Hibari: the Partition Detector. See xref:partition-detector[] for an overview and explanation of the ‘A’ and ‘B’ physical networks.

As described briefly in xref:cap-theorem-and-hibari[], Hibari does support the “Partition tolerance” aspect of Eric Brewer’s CAP theorem. More specifically, if a network partition occurs, and a Hibari cluster is split into two or more pieces, not all clients on both/all sides of the network partition will be able to access Hibari services.

For the sake of discussion, we assume the cluster has been split into two fragments by a single partition, though any number of fragments may happen in real use. We also assume that nodes on both sides of the partition are configured in standby roles for the Admin Server.

If a network partition event happens, the following events will soon follow:

  • The OTP application controller for some/all central.conf-configured nodes will notice that communication with the formerly active Admin Server is now impossible.
  • Using internal logic, each application controller will make a decision of which standby node should move to active status.
  • Each active status node will start an instance of the Admin Server.

Note that all steps above will happen in parallel on nodes on both sides of the partition. If this situation is permitted to continue, the invariant of “Admin Server may only run on one node at a time” will be violated. However, with the help of the Partition Detector application, multiple Admin Server instances can be detected and halted.

UDP broadcasts on the ‘A’ and ‘B’ networks can help the Admin Server determine if it was restarted due to an Admin Server crash or by a network partition. In case of a network partition on network ‘A’, the broadcasts on network ‘B’ can indicate that another Admin Server process remains alive.

If multiple Admin Server instances are detected, the following logic is used:

  • If an Admin Server is in its “running” phase, then any other any Admin Server instance that is still in its “initialization” phase will halt.
  • If multiple Admin Server instances are all in the “initialization” phase, then only the Admin Server instance with the smallest name (in lexicographic sorting order) is permitted to run: all other instances will halt.

==== Importance of two physically separate networks

IMPORTANT: It is possible for both the ‘A’ and ‘B’ networks to partition simultaneously. The Admin Server and Partition Detector applications cannot always correctly react to such events. It is extremely important that the ‘A’ and ‘B’ networks be separate physical networks, including: separate physical network interfaces on each brick, separate cabling, separate network switches, and all other network-related equipment also be physically separate.

It is possible to reduce the reliance on multiple physical networks and the Partition Detector application, but such techniques have not been added to Hibari yet. Until an alternative network partition mitigation mechanism is implemented, we strongly recommend the proper configuration of the Partition Detector app and all of its hardware requirements.

=== Admin Server, Network Partition, and Client Access

When a network partition event occurs, there are two cases that affect a client’s ability to work with the cluster.

  • The client machine is on the same side of the partition as the Admin Server.
  • The client machine is on the opposite side of the partition as the Admin Server.

If the client machine is on the same side of the partition, the client may see no interruption of service at all. If the Admin Server is restarted in reaction to the partition event, there may be a small window of time (e.g. 20-30 seconds) where requests might fail because the Admin Server has not yet reconfigured chains on this side of the partition.

If the client machine is on the opposite side of the partition, then the client will not have access to the Admin Server and may not have access to properly configured chains. If a chain lies entirely entirely on the same side of the partition as the client, then the client can continue to use that chain successfully. However, any chain that is “cut in two” by the partition cannot support updates by any client.

Hibari System Information: Configuration Files, Etc.

Hibari’s system information is stored in one of two places. The first is the application configuration file, central.conf. By default, this file is stored in TODO/{version number}/etc/central.conf.

The second location is within Hibari server nodes themselves. This kind of configuration, stored inside the “bootstrap” bricks, makes it easy to share data with all nodes in the cluster.

Many of configuration values in central.conf will be the same on all nodes in a Hibari cluster. Given this reality, why not store those items in Hibari itself? The biggest problem comes when the application is first starting. See xref:bricks-outside-chain-replication[] for an overview of why it isn’t easy to store all configuration data inside Hibari itself.

In the future, it’s likely that many of the configuration items in the central.conf file will move to storage within Hibari itself.

=== central.conf File Syntax and Usage

Each line of the central.conf file has the form

parameter: value

where parameter is the name of the configuration option being set and value is the value that the configuration option is being set to.

Valid data types for configuration settings are INT (integer), STRING (string), and ATOM (one of a pre-defined set of option names, such as on or off). Apart from data type restrictions, no further valid range restrictions are enforced for central.conf parameters.

All time values in central.conf (such as delivery retry intervals or transaction timeouts) must be set as a number of seconds.

Blank lines and lines beginning with the pound sign (#) are ignored.

IMPORTANT: To apply changes that you have made to the central.conf file, you must restart the server. There are exceptions to this rule, but it’s one of the cleanup/janitor tasks to access central.conf using a standard set of APIs, e.g. always use the gmt_config_svr API.

[[central-conf-parameters]] === Parameters in the central.conf File

A detailed explanation of each of the items in central.conf can be found at link:../misc-files/central-conf.pdf[Hibari central.conf Configuration Guide].

=== Admin Server Configuration

Configuration for the Hibari ``Admin Server’’ is stored in three places:

. The central.conf file . The Schema.local file . Inside the ``bootstrap’’ bricks

[[admin-server-in-central-conf]] ==== Admin Server entries in the central.conf file

The following entries in the central.conf file are used by the Hibari Admin Server:

  • admin_server_distributed_nodes
** This option specifies which nodes in the Hibari cluster are
eligible to run the Admin Server. Hibari server nodes not included in this list cannot run the Admin Server.
** Active/standby service is provided by the Erlang/OTP platform’s
application management facility.
  • The Schema.local file
** This file provides a list of {logical brick, Hibari server node name}
tuples that store the Admin Server’s private state. Each brick name in this list starts with the prefix bootstrap_copy followed by an integer.
  • The ``bootstrap’’ bricks
** Each of these bricks store an independent copy of all Hibari
cluster state: table definitions, table -> chain mappings, start & stop history, etc.
** Data in each of the bootstrap bricks is not maintained by chain
replication. Rather, quorum-style replication is used. See xref:bricks-outside-chain-replication[].

=== Configuration Not Stored in Editable Config Files

All table and chain configuration parameters are stored within the Admin Server’s ``schema’‘. The schema contains information on:

  • Table names and options (e.g. blob values stored in RAM or on disk, sync/async disk logging)
  • Table -> chain mappings
  • Chain -> brick mappings

Much of this information can be seen in HTML form by pointing a Web browser at TCP port 23080 (default) of any Hibari server node. For example:

.Admin Server Top-Level Status & Admin URL
http://hibari-server-node-hostname:23080/

Your Web browser should be redirected automatically to the Admin Server’s top-level status & admin page.

NOTE: The APIs that expose this are, for the most part, already written. We need more “friendly” wrapper funcs as part of the “try this first” set of APIs for administration.

The Life of a (Logical) Brick

All logical bricks within a Hibari cluster go through the same set of lifecycle events. Each is described in greater detail in this section.

  • Brick initialization and operation states, described by a finite state machine.
  • Brick roles within chain replication, also described by a finite state machine.
  • Periodic housekeeping tasks performed by logical bricks and their internal support services, e.g. checkpoints and the ``scavenger’‘.

[[brick-lifecycle-fsm]] === Brick Lifecycle Finite State Machine

The lifecycle of each Hibari logical brick goes through a set of states defined by a finite state machine (OTP gen_fsm behavior) that is executed by a process within the Admin Server application.

.Logical brick lifecycle finite state machine svgimage::images/brick-fsm[align=”center”]

.Logical brick lifecycle FSM states unknown;;

This is the initial state of the FSM. Because the Admin Server may crash or be restarted at any time, this state is used by the Admin Server when it has not been running long enough to determine the state of the logical brick.
pre_init;;
A brick moves itself to this state when it has finished scanning its private write-ahead log (see xref:write-ahead-logs[]) and therefore knows the state of all keys that it manages.
repairing;;
In chain replication, the repairing state is used to synchronize a a newly started/restart brick with the rest of the chain. At the end of this state, the brick is 100% in sync with all other active members of the chain. Repair is initiated by the Admin Server’s chain monitor that is responsible for the chain.
ok;;

The brick moves itself to this state when repair has finished. The brick is now in service and capable of servicing Hibari client requests. Client requests will be rejected if the brick is in any other state. * If managed by chain replication, this brick is eligible to be put

into service as a full member of a replication chain. See xref:brick-roles[].
  • If managed by quorum replication, some external entity must change the logical brick’s state from pre_init -> ok. Hibari’s Admin Server automates this task for the `bootstrap_copy`* bricks. The present implementation of the Admin Server does not manage quorum replication bricks outside of the Admin Server’s private use.
disk_error;;
A disk error has occurred, for example a missing file or directory or MD5 checksum error. Administrator intervention is required to move a brick out of the disk_error state: shut down the entire Hibari server, kill the logical brick manually, or use the brick_chainmon:force_best_first_brick() function manually.

[[chain-lifecycle-fsm]] === Chain Lifecycle Finite State Machine

The chain FSM (OTP gen_fsm behavior) is executed by a process within the Admin Server application. All state transitions are triggered by changes in the state of each member bricks’ state, into or out of the ‘ok’ state. See xref:brick-lifecycle-fsm[] for details.

.Chain replication finite state machine svgimage::images/chain-fsm[align=”center”]

.Chain lifecycle FSM states unknown;;

The state of the chain is unknown. Information regarding chain members is unavailable. Because the Admin Server may crash or be restarted at any time, this state is used by the Admin Server when it has not been running long enough to determine the state of the chain. It is possible that the chain was in degraded or healthy state before the crash and therefore Hibari client operations may be serviced while in this state.
unknown_timeout;;
This intermediate state is used by the Admin Server before moving automatically to another state.
stopped;;
All bricks in the chain are crashed or believed to have crashed. Service to Hibari clients will be interrupted.
degraded;;
Some (but not all) bricks in the chain are in service. The Admin Server will wait for another chain member to enter its pre_init state before chain repair can start.
healthy;;
All bricks in the chain are in service.

[[brick-roles]] === Brick ``Roles’’ Within A Chain

Each brick within a chain has a role. The role will be changed by the Admin Server whenever it detects that the chain’s state has changed. These roles are:

head;;
The brick is first in the chain, i.e. at the ``head’’ of the chain’s ordered list of bricks.
tail;;
The brick is last in the chain, i.e. at the ``tail’’ of the chain’s ordered list of bricks.
middle;;
The brick is neither the ``head’’ nor ``tail’’ of the chain. Instead, the brick is somewhere in the middle of the chain.
standalone;;
In a chain of length 1, the ``standalone’’ brick is a brick that acts both as a ``head’’ and ``tail’’ brick simultaneously.

There is one additional attribute that is given to a brick in a cluster. Its name ``official tail’‘.

official tail;;
The official tail brick has two duties for the chain: * It handles read-only queries to the chain. * It sends replies to the client for all update operations that are sent to the head of the chain.

the chain. Hibari clients are not aware of “tail” bricks that are undergoing repair. Any client request that is sent to a repairing state brick will be rejected.

See xref:diagram-write-path-3[] for an example of a healthy chain of length three.

[[brick-init]] === Brick Initialization

A logical brick does not maintain an on-disk data structure, such as a binary tree or B-tree, to keep track of the keys it stores. Instead, each logical brick maintains that metadata entirely in RAM. Therefore, the only time that the metadata in the private write-ahead log is ever read is at brick initialization time, i.e. when the brick restarts.

The contents of the private write-ahead log are used to repopulate the brick’s ``key catalog’‘, the list of all keys (and associated metadata) stored by the brick.

When a logical brick is started, all of the log sequence files in the private log are read, starting from the oldest and ending with the newest. (See xref:wal-dirs-and-files[].) The total amount of data required at startup can be quite small or it can be hundreds of gigabytes. The factors that influence the amount of data in the private log are:

  • The total number of keys stored by the logical brick.
** More keys means that the log sequence file created by a checkpoint
operation will be larger.
  • The size of the brick_check_checkpoint_max_mb configuration parameter in the central.conf config file.

When the log scan is complete, construction of the brick’s in-RAM key catalog is finished.

See xref:checkpoints[] for details on brick checkpoint operations.

[[chain-repair]] === Chain Repair

When a chain is in the degraded state, new bricks that have entered their pre_init state can become eligible to join the chain. All new bricks are added to the end of the chain and undergo the chain repair process.

.Chain of length 2 in degraded state, a third brick under repair svgimage::images/read-write-path-3-repair[align=”center”, scaledwidth=”80%”]

The protocol used between upstream and downstream bricks is an iterative protocol that has two phases in a single iteration.

1. The upstream brick sends a subset of {Key, Timestamp} tuples downstream. * The downstream brick deletes keys from its key catalog that do not appear in the upstream’s subset. * The downstream brick replies with the list of keys that it does not have or have older timestamps. 2. The upstream bricks sends full information (all key metadata and value blobs) for all keys requested by the downstream in step #1. * The downstream brick acknowledges the new/replacement keys.

When the repair is finished, the Admin Server will change the roles of some/all chain members to make the repairing brick the new tail of the chain.

Only one brick may be repaired at one time. In theory it is possible to repair multiple bricks simultaneously, but the extra code complexity that would be required to do so has been judged to be too expensive (so far).

==== Chain reordering when moving from degraded -> healthy states

[[chain-reordering-middle-brick-fails]] .Chain order after a middle brick fails and is repaired (but not yet reordered) svgimage::images/chain-fail-repair-reorder[align=”center”, scaledwidth=”70%”]

After a middle brick fails and is repaired, the chain’s ordering is: brick 1 -> brick 3 -> brick 2. According to the algorithm in the original Chain Replication paper, the final chain ordering is expected. The Hibari implementation adds another step: reordering the chain.

For chains longer than length 1, when the Admin Server moves the chain from degraded -> healthy state, the Admin Server will reorder the the chain to match the schema’s definition for the healthy chain order. The assumption is that the Hibari administrator wishes the chain use a very specific order when it is in the healthy state. For example, if the chain’s workload were extremely read-intensive, the machine for logical brick #3 could have faster CPU or faster disks than the other bricks in the chain. To take full advantage of the extra capacity, the chain should be reordered as soon as possible.

However, it is not easy to reorder the chain. The replication of a client update during the reordering could get lost and violate Hibari’s strong consistency guarantees. The following algorithm is used to preserve consistency:

  1. Set all bricks to read-only mode.
  2. Wait for all updates to sync to disk at each brick and to progress downstream fully from head -> tail.
  3. Set brick roles to reflect the final desired order.

4. Set all bricks to read-write mode. ** Client do operations that contain updates will be resubmitted

(via the client-side API function brick_server:do()) to the cluster.

Typically, executing this algorithm takes less than one second. However, because the head brick is forced temporarily into read-only mode, client update requests will be delayed until read-only mode is turned off.

Client update requests submitted during read-only mode will be queued by the head brick and will be processed when read-only mode is turned off. Client read-only requests are not affected by read-only mode.

// JWN: I think it might be helpful to mention/ to explain (but maybe // not here) that Client updates may actually persist even though the // client stopped waiting and returned a timeout to the “application”. // A Timeout on Client updates can not guarantee the change was // applied or not applied to the Hibari tables.

[[checkpoints]] === Brick Checkpoint Operations

As updates are received by a brick, those updates are written to the brick’s private write-ahead log. During normal operations, private write-ahead log is write-only: the data there is only read at logical brick initialization time.

The checkpoint operation is used to reclaim disk space in the brick’s private write-ahead log. See xref:wal-dirs-and-files[] for a description of log sequence files and xref:central-conf-parameters[] for details on the central.conf configuration file.

.Brick checkpoint processing steps 1. When the total log size (i.e. total size of all log files in the

brick’s private log’s shortterm storage area) reaches the size of the brick_check_checkpoint_max_mb parameter in central.conf, a checkpoint operation is started. * Assume that the current log sequence file number is N.
  1. Two log sequence files are created, N+1 and N+2.
  2. Checkpoint data is written to log sequence number N+1.
  3. New updates by clients and chain replication are written to log sequence number N+2.
  4. Contents of the brick’s in-RAM key catalog are dumped to log sequence file N+1, subject to the bandwidth constraint of the brick_check_checkpoint_throttle_bytes configuration parameter.
  5. When the checkpoint is finished and flushed to disk, all log sequence files with a number less than or equal to N are deleted.

IMPORTANT: Each logical brick will checkpoint itself as its private log grows. It is possible that multiple logical bricks can schedule checkpoint operations simultaneously. The bandwidth limitation of the brick_check_checkpoint_throttle_bytes parameter is applied to the _sum of all writes by all checkpoint operations_.

[[scavenger]] === The Scavenger

As described in xref:write-ahead-logs[], all updates from all logical bricks are first written to the ``common log’‘. The most common of these updates are:

  • Metadata updates, e.g. key insert or key delete, by a logical brick.
  • A new value blob associated with a metadata update such as a Hibari

client set operation. ** This type is only applicable if the brick is configured to store value blobs on disk. This configuration is defined (by default) on a per-table basis and is then propagated to the chain and brick level by the Admin Server.

As explained in xref:write-ahead-logs[], the write-ahead log provides infinite storage at a logical level. But in the physical level, disk space must be reclaimed somehow. Because the common log is shared by multiple logical bricks, the technique described in xref:checkpoints[] cannot be used by the common log.

A process called the ``scavenger’’ is used to reclaim disk space in the common log. By default, the scavenger runs at 03:00 daily. The steps it executes are described below.

.Common log scavenger processing steps 1. For all bricks that store value blobs on disk, scan each logical brick’s in-RAM key catalog to create a list of all value blob storage locations. 2. Sort the value blob location list by log sequence number. 3. Identify all log sequence files with a “live data ratio” of at least X percent (default = 90%, see brick_skip_live_percentage_greater_than configuration parameter). 4. For all log files where live data ratio is less than *X*%, copy value blobs to new log sequence files. This copying is limited by the amount of bandwidth configured by brick_scavenger_throttle_bytes in central.conf. 5. When all blobs have been copied out of an old log sequence file and flushed to stable storage, update the storage locations in the in-RAM key catalog, then delete the old log sequence file.

ifdef::theme[] image:images/scavenger-techpubs.png[] endif::theme[] ifndef::theme[] image:images/scavenger-techpubs.png[width=”65%”] endif::theme[]

IMPORTANT: The value of the brick_skip_live_percentage_greater_than configuration parameter determines how much additional disk space is required to store X gigabytes of live data. If the parameter is N, then 100-N percent of all common log disk space may be wasted by storing dead data.

IMPORTANT: Additional disk space is required to log all updates that are made after the scavenger has run. This includes space in the common log as well as in each logical brick private logs (subject to the general limit of the brick_check_checkpoint_max_mb configuration parameter.

IMPORTANT: The current implementation of Hibari requires that plenty of disk space _always_ be available for write-ahead logs and for scavenger operations. We strongly recommend that the brick_scavenger_temp_dir configuration item use a different file system than the application_data_dir parameter. This directory stores temporary files required for sorting and other operations that would otherwise require large amounts of RAM.

Dynamic Cluster Reconfiguration

[[add-table]] === Adding a Table

A table can be added at any time, using either of two methods:

  • Use the Admin Server’s HTTP service: follow the “Add a table” hyperlink at the bottom of the top-level page.
  • Use the brick_admin CLI interface at the Erlang shell. See

link:hibari-contributor-guide.en.html#add-a-new-table[Hibari Contributor’s Guide, “Add a New Table” section].

[[remove-table]] === Removing a Table

NOTE: The current Hibari implementation does not support removing a table.

In theory, most of the work of removing a table is already done: chains that are abandoned after a migration are shut down * Brick pinger processes are stopped. * Chain monitor processes are stopped. * Bricks are stopped. * Brick data directories are removed.

All that remains is to update the Admin Server’s schema to remove references to the table.

[[chain-length-change]] === Changing Chain Length (Changing Replication Factor)

The Hibari Admin Server manages each chain as an independent data replication entity. Though Hibari clients view multiple chains that are associated with a single table, each chain is actually independent of the other chains. It is possible to change the length of one chain without changing any others. For long term operation, such differences do not make sense. But during short periods of cluster reconfiguration, such differences are possible.

A chain’s length is determined by specifying a list of bricks that are members of that chain. The order of the list specifies the exact chain order when the chain is in the healthy state. By adding or removing bricks from a chain definition, the length of the chain can be changed.

A chain is defined by the Erlang 2-tuple of {ChainName, ListOfBricks}, where each brick in ListOfBricks is a 2-tuple {BrickName, NodeName}. For example, a chain of length two called footab_ch1 could be defined as:

{footab_ch1, [{footab1_ch1_b1, 'gdss1@box-a‘}, {footab1_ch1_b1, 'gdss1@box-b‘}]}

The current definition of all chains for table TableName can be retrieved from the Admin Server using the brick_admin:get_table_chain_list() function, for example:

%% Get a list of all tables currently defined. > brick_admin:get_tables(). [tab1]

%% Get list of chains in ‘tab1’ as they are currently in operation. > brick_admin:get_table_chain_list(tab1). {ok,[{tab1_ch1,[{tab1_ch1_b1,’gdss1@machine-1‘},

{tab1_ch1_b2,’gdss1@machine-2‘}]},
{tab1_ch2,[{tab1_ch2_b1,’gdss1@machine-2‘},
{tab1_ch2_b2,’gdss1@machine-1‘}]}]}

This above chain list for table tab1 corresponds to the chain and brick layout below.

.Table tab1: Two chains of length two across two Erlang nodes on two physical machines svgimage::images/tab1-2x2[align=”center”, scaledwidth=”70%”]

NOTE: To change the definition of a chain, use the change_chain_length/2 or change_chain_length/3 functions. For documentation, see link:hibari-contributor-guide.en.html#changing-chain-length[Hibari Contributor’s Guide, “Changing Chain Length” section]

NOTE: When specifying a new chain definition, at least one brick from the current chain must be included.

// JWN: Is it dangerous to allow an admin the opportunity to NOT SPECIFY the head // of the chain in the new chain definition or to SPECIFY only a brick // that is under repair? I guess I see an opportunity for some // “dynamic” (and not just static) pre-conditions that should/could be // checked FIRST before starting to execute the changes.

[[chain-change-same-algorithm]] ==== Chain changes: same algorithm, different tasks.

The same brick repair technique is used to handle all three of the following cases:

  • adding a brick to a chain
  • brick failure
  • removing a brick from a chain

==== Adding a brick to a chain

When a brick B is added to a chain, that brick is treated as if it was a member of the chain that had crashed long ago and has now been restarted. The same repair algorithm is used to synchronize data on brick B that is used to repair bricks that were formerly in service but since crashed and restarted. See xref:chain-repair[] for a description of the Hibari repair mechanism.

==== Brick failure

If a brick fails, the Admin Server must remove it from the chain by reordering the chain. The general order of operations are:

  1. Set new roles for the chain’s bricks, starting from the end of the chain and working backward.
  2. Broadcast the new chain membership to all Hibari clients.

If a Hibari client attempts to send an operation to a brick during step #2 and before the new chain info from step #2 arrives, that client may send the operation to the wrong brick. Hibari servers will automatically forward the query to the correct brick. Due to network latencies and asynchronous message passing, it is possible that the query be forwarded multiple times before it arrives at the correct brick.

Specific details of how chain replication handles brick failure can be found in van Renesse and Schneider’s paper, see xref:chains[] for citation details.

===== Failure of a head brick

If the head brick fails, then the first middle brick is promoted to the head role. If there is no middle brick (i.e. the length of the chain was two), then the tail brick is promoted to a standalone role (chain length is one).

===== Failure of a tail brick

If the tail brick fails, then the last middle brick is promoted to the tail role. If there is no middle brick (i.e. the length of the chain was two), then the head brick is promoted to a standalone role (chain length is one).

[[failure-middle-brick]] ===== Failure of a middle brick

The failure of a middle brick requires the most complex recovery procedure.

  • Assume that the chain is three bricks: A -> B -> C.
** If the chain is longer (more bricks upstream of A and/or more
bricks downstream of C), the procedure remains the same.
  • Brick C is configured to have its upstream brick be A.
  • Brick A is configured to have its downstream brick be C.
  • The head of the chain (brick A or the head brick upstream of A) requests a log flush of all unacknowledged writes downstream. This step is required to re-send updates that were processed by A but have not been received by C because of middle brick B‘s failure.
  • Brick A waits until it receives a write acknowledgment from the tail of the chain. Once received, all bricks in the chain have synchronously written all items to their write-ahead logs in the correct order.

==== Removing a brick from a chain

Removing a brick B permanently from a chain is a simple operation. Brick B is handled the same way that any other brick failure is handled: the chain is simply reconfigured to exclude B. See xref:chain-reordering-middle-brick-fails[] for an example.

IMPORTANT: When a brick B is removed from a chain, all data from brick B will be deleted when the operation is successful. At this time, the API does not have an option to allow B‘s data to be preserved.

// JWN: Wah ... a typo could be very dangerous. Delayed deletion of // the data and/or some other protective mechanism could be helpful.

[[chain-migration]] === Chain Migration: Rebalancing Data Across Chains

There are several cases where it is desirable to rebalance data across chains and bricks in a Hibari cluster:

  • Chains are added or removed from the cluster
  • Brick hardware is changed, e.g. adding extra disk or RAM capacity
  • A change in a table’s consistent hashing algorithm configuration forces data (by definition) to another chain.

The same technique is used in all of these cases: chain migration. This mirrors the same design philosophy that’s used for handling chain changes (see xref:chain-change-same-algorithm[]): use the same algorithm to handle multiple use cases.

==== Example: Migrating from three chains to four

[[chain-migration-3to4]] .Chain migration from 3 chains to 4 chains svgimage::images/chain-migration-3to4[align=”center”, scaledwidth=”80%”]

In the example above, both the 3-chain and 4-chain configurations used equal weighting factors. When all chains use the same weighting factor (e.g. 100), then the consistent hashing map in the ``before’’ and ``after’’ cases look something like the figure below.

[[migration-3to4]] .Migration from three chains to four chains svgimage::images/migration-3to4[align=”center”, scaledwidth=”70%”]

It doesn’t matter that chain #4’s total area within the unit interval is divided into three regions. What matters is that chain #4’s total area is equal to the regions of the other three chains.

==== Example: Migrating from three chains to four with unequal weighting

The diagram xref:migration-3to4[] demonstrates how a migration would work when all chains have an equal weighting factor, e.g. 100. If instead, the new chain had a weighting factor of only 50, then the distribution of keys to each chain would look like this:

.Migration from three chains to four with unequal chain weighting factors [options=”header”] |========= | Chain Name | Total % of keys before/after migration | Total unit interval size before/after migration | Chain 1 | 33.3% -> 28.6% | 100/300 -> 100/350 | Chain 2 | 33.3% -> 28.6% | 100/300 -> 100/350 | Chain 3 | 33.3% -> 28.6% | 100/300 -> 100/350 | Chain 4 | 0% -> 14.3% (4.8% in each of 3 regions) | 0/300 -> 50/350 (spread across 3 regions) | Total | 100% -> 100% | 300/300 -> 350/350 |=========

For the original three chains, the total amount of unit interval devoted to those chains is (100+100+100)/350 = 300/350. The 4th chain, because its weighting is only 50, would be assigned 50/350 of the unit interval. Then, an equal amount of unit interval is taken from the original chains and reassigned to chain #4, so (50/350)/3 of the unit interval must be taken from each original chain.

==== Hotspot migration

With the lowest level API, it is possible to assign “hot” keys to specific chains, to try to balance a handful of keys that are very frequently accessed from a large number of keys that are very infrequently accessed. The table below gives an example that builds upon xref:migration-3to4[]. We assume that our “hot” key is mapped onto the unit interval at position 0.5.

.Consistent hashing lookup table with three chains of equal weight and a fourth chain with an extremely small weight [options=”header”] |========= | Unit interval start | Unit interval end | Chain name | 0.000000 | 0.333333... | Chain 1 | 0.333333... | 0.5 | Chain 2 | 0.5 | 0.500000000000001 | Chain 4 | 0.500000000000001 | 0.666666... | Chain 2 | 0.666666... | 1.0 | Chain 3 |=========

The table above looks almost exactly like the “Before Migration” half of xref:migration-3to4[]. However, there’s a very tiny “hole” that is punched in chain #2’s space that maps key hashes in the range of 0.5 to 0.500000000000001 to chain #4.

[[adding-removing-client-nodes]] === Adding/Removing Client Nodes

It is not strictly necessary to formally configure a list of all Hibari client nodes that may use a Hibari cluster. However, practically speaking, it is useful to do so.

To bootstrap itself to be able to use Hibari servers, a Hibari client must be able to:

  1. Communicate with other Erlang nodes in the cluster.
  2. Receive “global hash” information from the cluster’s Admin Server.

To solve both problems, the Admin Server maintains a list of Hibari client nodes. (Hibari server nodes do not need this mechanism.) For each client node, a monitor process on the Admin Server polls the node to see if the gdss or gdss_client application is running. If the client node is running, then problem #1 (connecting to other nodes in the cluster) is automatically solved by using net_adm:ping/1. Problem #2 is solved by the client monitor calling brick_admin:spam_gh_to_all_nodes/0.

The Admin Server’s client monitor runs approximately once per second, so there may be a delay of up to a couple of seconds before a newly-started Hibari client node is connected to the rest of the cluster and has all of the table info required to start work.

When a client node goes down, an OTP alarm is raised until the client is up and running again.

Two methods can be used to view and change the client node monitor list:

  • Use the Admin Server’s HTTP service: follow the “Add/Delete a client node monitor” hyperlink at the bottom of the top-level page.
  • Use the Erlang CLI to use these functions:

** brick_admin:add_client_monitor/1 ** brick_admin:delete_client_monitor/1 ** brick_admin:get_client_monitor_list/0

The Partition Detector Application

For multi-node Hibari deployments, Hibari includes a network monitoring feature that watches for partitions within the cluster, and attempts to minimize the database consequences of such partitions. This Erlang/OTP application is called the Partition Detector.

You can configure the network monitoring feature in the central.conf file. See xref:central-conf-parameters[] for details.

IMPORTANT: Use of this feature is mandatory for a multi-node Hibari deployment to prevent data corruption in the event of a network partition. If you don’t care about data loss, then as an ancient Roman might say, ``Caveat emptor.’’ Or in English, ``Let the buyer beware.’‘

For the network monitoring feature to work properly, you must first set up two separate networks, Network A and Network B, that connect to each of your Hibari physical bricks. The networks must be set up as follows:

  • Network A and Network B must be physically separate networks, with

different IP and broadcast addresses. See the diagram below for a two node cluster. * Network A must be the network used for all Hibari data communications. * Network A should have as few physical failure points as possible. For example, a single switch or load balancer is preferable to two switches cabled together. * The separate Network B will be used to compare node heartbeat patterns.

IMPORTANT: For the network partition monitor to work properly, your network partition monitor configuration settings must match as closely as possible. Each Hibari physical brick must have unique IP addresses on its two network interfaces (as required by all IP networks), but all configurations must use the same IP subnets for the ‘A’ and ‘B’ networks, and all configurations must use the same network ‘A’ tiebreaker.

[[a-and-b-network-diagram]] .Network ‘A’ and network ‘B’ diagram svgimage::images/a-and-b-diagram[align=”center”, scaledwidth=”80%”]

=== Partition Detector Heartbeats

Through the partition monitoring application, Hibari nodes send heartbeat messages to one another at the configurable heartbeat_beacon_interval, and each node keeps track of heartbeat history from each of the other nodes in the cluster. The heartbeats are transmitted through both Network A and Network B. If node gdss1@machine1 detects that the incoming heartbeats from gdss1@machine2 are absent both on Network A and on Network B, then gdss1@machine2 might have a problem. If the incoming heartbeats from gdss1@machine2 fail on Network A but not on Network B, a partition on Network A might be the cause. If heartbeats fail on Network B but not Network A, then Network B might have a partition problem, but this is less serious because Hibari data communication does not take place on Network B.

Configurable timers on each Hibari node determine the interval at which the absence of incoming heartbeats from another node is considered a problem. If on node gdss1@machine1 no heartbeat has been received from gdss1@machine2 for the duration of the configurable heartbeat_warning_interval, then a warning message is written to the application log of node gdss1@machine1. This warning message can be triggered by missing heartbeats either on Network A or on Network B; the warning message will indicate which node has not been heard from, and over which network.

=== Partition Detector’s Tiebreaker

If on node gdss1@machine1 no heartbeat has been received from gdss1@machine2 via Network A for the duration of the configurable heartbeat_failure_interval, and if during that period heartbeats from gdss1@machine2 continue to be received via Network B, then a network partition is presumed to have occurred in Network A. In this scenario, node gdss1@machine1 will attempt to ping the configurable network_a_tiebreaker address. If gdss1@machine1 successfully pings the tiebreaker address, then gdss1@machine1 considers itself to be on the “correct” side of the Network A partition, and it continues running. If by contrast gdss1@machine1 cannot successfully ping the tiebreaker address, then gdss1@machine1 considers itself to be on the “wrong” side of the Network A partition and shuts itself down. Meanwhile, comparable calculations and decisions are being made by node gdss1@machine2.

In a scenario where the network monitoring application determines that a partition has occurred on Network B – that is, heartbeats are received through Network A but not through Network B – then warnings are written to the Hibari nodes’ application logs but no node is shut down.

Backup and Disaster Recovery

=== Backup and Recovery Software

At the time of writing, Hibari’s largest cluster deployment is:

  • Well over 50 physical bricks
  • Well over 4TB of disk space per physical brick
  • Single data center, operated by a telecom carrier and integrated with third-party monitoring and control software

If a backup were made of all data in the cluster, the biggest question is, “Where would you store the backup?” Given the cluster’s purpose (real-time email/messaging services), the quality of the data center’s physical and software infrastructures, the length of the Hibari chains used for physical data redundancy, the business factors influencing the choice not to deploy a “hot backup” data center, and other factors, Cloudian has not developed the backup and recovery software for Hibari. Cloudian’s smaller Hibari deployments also resemble the largest deployment.

However, we expect that backup and recovery software will be high priorities for open source Hibari users. Together with the open source users and developers, we expect this software to be developed relatively quickly.

=== Disaster Recovery via Remote Data Centers

==== Single Hibari cluster spanning two data centers

It is certainly possible to deploy a single Hibari cluster across two (or more) data centers. At the moment, however, there is only one way of doing it: each chain of data replication must have a brick located in each data center.

As a consequence of brick placement, it is mandatory that Hibari clients pay the full round-trip latency penalty for each update. See xref:diagram-write-path-3[] for a diagram; the “head” and “tail” bricks would be in separate data centers, using WAN network connectivity between them.

For some applications, strong consistency is a higher priority than low latency (both for writes and possibly for reads, if the client is not co-located in the same data center as the chain’s tail brick). In those cases, such cross-data-center brick placement can make sense.

However, Hibari’s Admin Server cannot handle all failure scenarios, especially when WAN connectivity is broken between data centers; more programming work is required for the Admin Server to automate the handling of all processes. Furthermore, Hibari’s basic design cannot tolerate network partitions well, see xref:cap-theorem-and-hibari[] and xref:admin-server-and-network-partition[]. If the Admin Server were capable of handling WAN network partitions, it’s almost certain that all Hibari nodes in one of the partitioned data centers would be inactive.

==== Multiple Hibari clusters, one per data center

Conceptually, it’s possible to run multiple Hibari clusters, one per data center. However, Hibari does not have the software required for WAN-scale replication.

In theory, such software isn’t too difficult to develop. The tail brick of each chain can maintain a log of recent updates to the chain. Those updates can be transmitted asynchronously across a WAN to another Hibari cluster in a remote data center. Such a scheme is depicted in the figure below.

[[async-replication-try1]] .A future scenario of asynchronous, cross-data-center Hibari replication svgimage::images/async-replication-try1[align=”center”, scaledwidth=”80%”]

This kind of replication makes the most sense if “Data Center #1” were in an active role and Data Center #2” were in a hot-standby role. In that case, there would never be a “Data Center #2 Client”, so there would be no problem of strong consistency violations by clients accessing both Hibari clusters simultaneously. The only consistency problem would be one of durability: the replay of async update logs every N seconds would mean that up to N seconds of updates within “Data Center #1” could be lost.

However, if clients access both Hibari clusters simultaneously, then Hibari’s strong consistency guarantee would be violated. Some applications can tolerate weakened consistency. Other applications, however, cannot. For the those apps that must have strong consistency, Hibari will require additional design and code.

TIP: A keen-eyed reader will notice that xref:async-replication-try1[] is not fully symmetric. If clients in “Data Center #2” make updates to the chain, then the same async update log maintenance and replay to “Data Center #1” would also be necessary.

Hibari Application Logging

NOTE: This chapter is outdated and will be rewritten by Hibari v0.6 release. Hibari now uses link:https://github.com/basho/lager#readme[Basho Lager] for logging and the default location of the log files is: <HIBARI_HOME>/logs/

The Hibari application log records application-related alerts, warnings, and informational messages, as well as trace messages for debugging. By default the application log is written to this file:

<HIBARI_HOME>/var/log/gdss-app.log

=== Format of the Hibari Application Log

Each log entry in the Hibari application log is composed of these fields in this order, with vertical bar delimitation:

<PID>|<<ERLANGPID>>|<DATETIME>|<MODULE>|<LEVEL>|<MESSAGECODE>|<MESSAGE>

This Hibari application log entry format is not configurable. Each of these application log entry fields is described in the table that follows. The ``Position’’ column indicates the position of the field within a log entry.

[options=”header”,cols=”^,^m,<”] |========= | Position | Field | Description | 1 | <PID> | System-assigned process identifier (PID) of the process that generated the log message. | 2 | <ERLANGPID> | Erlang process identifier. | 3 | <DATETIME> | Timestamp in format %Y%m%d%H%M%S, where %Y = four digit year; %m = two digit month; %d = two digit date; %H = two digit hour; %M = two digit minute; and %S = two digit seconds. For example, 20081103230123. | 4 | <MODULE> | The internal component with which the message is associated. This field is set to a minimum length of 13 characters. If the module name is shorter than 13 characters, spaces will be appended to the module name so that the field reaches the 13 character minimum. | 5 | <LEVEL> | The severity level of the message. The level will be one of the following: ALERT, a condition requiring immediate correction; WARNG, a warning message, indicating a potential problem; INFO, an informational message indicating normal activity, and requiring no action; DEBUG, a highly granular, process-descriptive message potentially of use when debugging the application. | 6 | <MESSAGECODE> | Integer code assigned to all messages of severity level INFO or higher. NOTE: This code is not yet defined in the Hibari open source release. | 7 | <MESSAGE> | The message itself, describing the event that has occurred. |=========

=== Application Log Example

Items written to the Hibari application log come from multiple sources:

  • The Hibari OTP application
  • Other OTP applications bundled with Hibari
  • Other OTP applications within the Erlang runtime system, e.g. kernel and sasl.

The <MESSAGE> field is free-form text. Application code can freely add newline characters and various white-space padding wherever it wishes. However, the file format dictates that a newline character (ASCII 10) appear only at the end of the entire app log message.

The Hibari error logger must therefore reformat the text of the <MESSAGE> field to remove newlines and to remove whitespace padding. The result is not nearly as readable as the formatting presented to the Erlang shell. For example, within the shell, a message can look like this:

=PROGRESS REPORT==== 12-Apr-2010::17:49:22 ===
supervisor: {local,sasl_safe_sup}
started: [{pid,<0.43.0>},
{name,alarm_handler}, {mfa,{alarm_handler,start_link,[]}}, {restart_type,permanent}, {shutdown,2000}, {child_type,worker}]

Within the Hibari application log, however, the same message is reformatted as line #2 below. The reformatted version is much more difficult for a human to read than the version above, but the purpose of the app log file is to be machine-parsable, not human-parsable.

8955|<0.54.0>|20100412174922|gmt_app |INFO|2190301|start: normal [] 8955|<0.55.0>|20100412174922|SASL |INFO|2199999|progress: [{supervisor,{local,gmt_sup}},{started,[{pid,<0.56.0>},{name,gmt_config_svr},{mfa,{gmt_config_svr,start_link,[”../priv/central.conf”]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}] 8955|<0.55.0>|20100412174922|SASL |INFO|2199999|progress: [{supervisor,{local,gmt_sup}},{started,[{pid,<0.57.0>},{name,gmt_tlog_svr},{mfa,{gmt_tlog_svr,start_link,[]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}] 8955|<0.36.0>|20100412174922|SASL |INFO|2199999|progress: [{supervisor,{local,kernel_safe_sup}},{started,[{pid,<0.59.0>},{name,timer_server},{mfa,{timer,start_link,[]}},{restart_type,permanent},{shutdown,1000},{child_type,worker}]}] [...skipping ahead...] 8955|<0.7.0>|20100412174923|SASL |INFO|2199999|progress: [{application,gdss},{started_at,gdss_dev2@bb3}] 8955|<0.98.0>|20100412174923|DEFAULT |INFO|2199999|brick_sb: Admin Server not registered yet, retrying 8955|<0.65.0>|20100412174923|SASL |INFO|2199999|progress: [{supervisor,{local,brick_admin_sup}},{started,[{pid,<0.98.0>},{name,brick_sb},{mfa,{brick_sb,start_link,[]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}] 8955|<0.105.0>|20100412174924|DEFAULT |INFO|2199999|top of init: bootstrap_copy1, [{implementation_module,brick_ets},{default_data_dir,”.”}] 8955|<0.105.0>|20100412174924|DEFAULT |INFO|2199999|do_init_second_half: bootstrap_copy1 8955|<0.79.0>|20100412174924|SASL |INFO|2199999|progress: [{supervisor,{local,brick_brick_sup}},{started,[{pid,<0.105.0>},{name,bootstrap_copy1},{mfa,{brick_server,start_link,[bootstrap_copy1,[{default_data_dir,”.”}]]}},{restart_type,temporary},{shutdown,2000},{child_type,worker}]}] 8955|<0.105.0>|20100412174924|DEFAULT |INFO|2199999|do_init_second_half: bootstrap_copy1 finished

== Examining Latency in Production (Internal Event Tracing)

The Hibari source code has been annotated with over 400 tracepoints, and they give the developer and system administrator for tracing events through Hibari’s code. Those tracepoints are designed to be extremely lightweight and can be enabled in production environment without sacrificing performance.

Trace data can be collected via DTrace/SystemTap or Erlang’s tracing mechanism. For more details, please refer link:http://hibari.github.com/hibari-doc/hibari-contributor-guide.en.html#_hibari_internal_tracepoints[“Hibari internal tracepoints”] section of Hibari Contributor’s Guide.

Hardware and Software Considerations

As noted in xref:hibari-origins[], at the time of writing, Hibari has been deployed exclusively in data centers run by telecom carriers. All carriers have very specific requirements for integrating with its existing deployment, network monitoring, alarm management, and other infrastructures. As a result, many of those features have been omitted to date from Hibari. With Hibari’s release into an “open source environment”, we expect that these gaps will be closed.

Hibari’s carrier-centric heritage has also influenced the types of hardware, networking gear, operating system, support software, and internal Hibari configuration that have been used successfully to date. Some of these practices will change as Hibari evolves from its original use patterns. Until then, this section discusses some of the things that a systems/network administrator must consider when deploying a Hibari cluster.

Similarly, application developers must be very familiar with these same issues. An unaware developer can create an application that uses too many resources on under-specified hardware, causing problems for developers, support staff, and application users alike. We wish Hibari to grow and flourish in its non-relational DB niche.

[[brick-hardware]] === Notes on Brick Hardware

==== Lots of RAM is better

Each Hibari logical brick stores all information about its keys in RAM. Both the logical brick’s private write-ahead log and the common write-ahead log are not “disk-based data structures” in the typical sense, such as on-disk hash tables or B-trees. Therefore, Hibari bricks require a lot of RAM to function.

For more details, see:

  • xref:overview-high-performance[]
  • xref:per-table-config-perf-options[] ... if a table stores its value blobs in RAM, it will consume more RAM than if those value blobs are stored on disk.
  • xref:hibari-data-model[]
  • xref:brick-init[]

==== Lots of disk I/O capacity is better

By default, Hibari will write and flush each update to disk before sending a reply downstream or back to the client. Hibari will perform better on systems that have higher disk I/O capacity.

  • Non-volatile/battery-backed cache on the disk controller(s) is helpful, when combined with a write-back cache policy. The more cache, the better. If the read/write ratio of the cache can be changed, a default policy of 10/90 or 0/100 (i.e. skewed to writes) is typically more helpful than a default 50/50 split.
  • On-disk (volatile) cache on individual disks is not helpful.
  • Faster spinning disks are more helpful than slower spinning disks.
  • If using RAID, a large stripe width of e.g. 512KBytes or 1024KBytes is usually more helpful than the (usually) smaller default stripe width on most controllers.
  • If using RAID, a hardware RAID implementation may be very slightly helpful.
  • RAID redundancy (e.g. RAID 1, 10, 5, 6) is not required by Hibari, but it can help reduce the odds of failure of an individual physical brick. If physical bricks do not use data redundant RAID (e.g. RAID 0, concatenation), it’s a good idea to consider using longer replication chains to compensate.

For more details, see:

  • xref:the-physical-brick[]
  • xref:per-table-config-perf-options[]
  • xref:hibari-data-model[]

[[high-io-rate-devices]] ==== High I/O rate devices (e.g. SSD) may be used

Hibari has some support for high I/O rate devices such as solid state disks, flash memory disks, flash memory storage cards, et al. There is nothing in Hibari’s implementation that would preclude using high-speed disk devices as the only storage for Hibari write-ahead logs.

Hibari has a feature that can segregate high write I/O with fsync(2) operations onto a separate high-speed device, and use cheaper & lower-speed Winchester disk devices for bulk storage. This feature has not yet been well-tested and optimized.

For more details, see:

  • xref:write-ahead-logs[]
  • xref:two-wal-types[]

==== Lots of disk storage capacity may be a secondary concern

More disks of smaller capacity are almost always more helpful than a few disks of larger capacity. RAID 0 (no data redundancy) or RAID 10 (“mirror” data redundancy) is useful for combining the I/O capacity of multiple disks into a single logical volume. Other RAID levels, such as 5 or 6, can be used, though at the expense of higher write I/O overhead.

For more details, see:

  • xref:write-ahead-logs[]

[[considerations-cpu]] ==== Lots of CPU capacity is a secondary concern

Hibari storage bricks do not, as a general rule, require large amounts of CPU capacity. The largest single source of CPU consumption is in MD5 checksum calculation. If the data objects most commonly written & read by your application are small, then multi-socket, multi-core CPUs are not required.

Each Hibari logical brick is implemented within the Erlang virtual machine as a single gen_server process. Therefore, each logical brick can (generally speaking) only fully utilize one CPU core. If your Hibari cluster appears to have CPU-utilization imbalance, then the recommended strategy is to change the chain placement policy of the chains. For example, there are two methods for arranging a chain of length three across three physical bricks:

[[1-chain-striped-across-3-bricks]] The first example shows one chain striped across three physical bricks. If the read/write ratio for the chain is extremely high (i.e. most operations are reads), then most of the CPU activity (and perhaps disk I/O, if blobs are stored on disk) will be directed to the “Chain 1 tail” brick and cause a CPU utilization imbalance.

.One chain striped across three physical bricks
Physical Brick X | Physical Brick Y | Physical Brick Z |
Chain 1 head -> Chain 1 middle -> Chain 1 tail

[[3-chains-striped-across-3-bricks]] The second example shows the same three physical bricks but with three chains striped across them. In this example, each physical brick is responsible for three different roles: head, middle, and tail. Regardless of the read/write operation ratio, all bricks will utilize roughly the same amount of CPU.

.Three chains striped across three physical bricks
Physical Brick T | Physical Brick U | Physical Brick V |
Chain 1 head -> Chain 1 middle -> Chain 1 tail || Chain 2 tail || Chain 2 head -> Chain 2 middle -> Chain 3 middle -> Chain 3 tail || Chain 3 head ->

In multi-CPU and multi-core systems, a side-effect of using more chains (and therefore more bricks) is that the Erlang virtual machine can schedule more logical brick computation across a larger number of cores and CPUs.

=== Notes on Networking

Hibari works quite well using commodity “Gigabit Ethernet” interfaces. Lower latency (and higher cost) networking gear, such as Infiniband, is not required.

For production use, it is _strongly recommended_ that all Hibari servers be configured with two physical network interfaces, cabling, switches, etc. For more details, see:

  • xref:partition-detector[]

==== Client protocol load balancing

The native Erlang client, via the gdss or gdss_client OTP applications, do not require any load balancing. The Erlang client already is a participant in the consistent hashing algorithm (see xref:consistent-hashing-example[]). The Admin Server distributes updates to a table’s consistent hash map each time cluster membership or chain/brick status changes.

All other client access protocols are “dumb”, by comparison. Take for example the Amazon S3 protocol service. There is no easy way for a Hibari cluster to convey to a generic HTTP client how to calculate which brick to send a query to. The HTTP redirect mechanism could be used for this purpose, but other protocols don’t have an equivalent feature. Also, the latency overhead of sending a redirect is far higher than Hibari’s solution to this problem.

Hibari’s solution is simple: the Hibari server-side “dumb” protocol handler uses the same native Erlang client that any other Hibari client app written in Erlang users. That client is capable of making direct routing decisions. Therefore, the “dumb” protocol handler within a Hibari node acts as a translating proxy: it uses the “dumb” client access protocol on one side and uses the native Erlang client API on the other.

.Hibari “dumb” protocol proxy svgimage::images/dumb-protocol-proxy[align=”center”, scaledwidth=”80%”]

The deployed “state of the art” for such dumb protocols is to use a TCP load balancer (aka a “layer 4” load balancer) to spread dumb client workload across multiple Hibari dumb protocol servers.

=== Notes on Operating System

Hibari servers operate on top of the Erlang virtual machine. In principle, any operating system that is supported by the Erlang virtual machine can support Hibari.

==== Supported Operating Systems

In practice, Hibari is supported on the following operating systems:

  • Linux x86_64 ** Red Hat Enterprise Linux 5.x and 6.x (RHEL 5.3 is used in

    production and QA environments within Cloudian, Inc.)

    ** CentOS 5.x and 6.x ** Ubuntu 12.04 LTS or newer

  • Linux ARMv7 (32 bit) ** Ubuntu 12.04 LTS or newer ** Hibari runs on Calxeda EnergyCore based super high-density,

    scale-out cluster

  • Unix Solaris variants ** Joyent SmartOS (64 bit)

  • Mac OS X

  • FreeBSD (though not currently in a jail environment, due to some TCP services getting EPROTONOSUPPORT errors)

The versions recently tested for Hibari by the community:

  • CentOS 6.3 (x86_64)
  • Ubuntu 12.04 LTS (ARMv7)
  • Joyent SmartOS 20130221 (64 bit)

To take advantage of RAM larger than 4GB, we recommend that you use a 64-bit version of your OS’s kernel, 64-bit versions of the user runtime, and a 64-bit version of the Erlang/OTP runtime.

[[os-readahead-configuration]] ==== OS Readahead Configuration

Some operating systems have support for OS-based “readahead”: pre-fetching blocks of a file with the expectation that those blocks will soon be requested by the application. Properly configured, readahead can substantially raise throughput and reduce latency on many read-heavy I/O workloads.

The read I/O workloads for Hibari fall into two major categories:

  1. Extremely predictable sequential read-only I/O during brick initialization (see xref:brick-init[]).
  2. Extremely unpredictable random read I/O for fetching value blobs from disk.

The first I/O pattern can usually benefit a great deal from an aggressive readahead policy. However, an aggressive readahead policy can have the opposite effect on the second I/O pattern. Readahead policies under Linux, for example, are defined on a per-block device basis and does not change in response to application runtime behavior.

If your OS supports readahead policy configuration, we recommend using a small read and then measuring its effect with a real or simulated workload with the real Hibari server.

[[disk-scheduler-configuration]] ==== Disk Scheduler Configuration

We recommend that you experiment with disk scheduler configuration on relevant OSes such as Linux. The “deadline” scheduler is likely to provide better performance characteristics.

=== Notes on Supporting Software

A typical “server” type installation of a Linux or FreeBSD OS is sufficient for Hibari. The following is an incomplete list of other software packages that are necessary for Hibari’s installation and/or runtime.

  • NTP
  • Erlang/OTP version R13B04
  • Either “lynx” or “elinks”, a text-based Web browser

// JWN: This seems like a good place to mention patches that are // needed beyond R13B04 ... busy dist port?

[[ntp-config-strongly-recommended]] ==== NTP configuration of all Hibari server and client nodes

It is strongly recommended that all Hibari server and client nodes have the NTP daemon (Network Time Protocol) installed, properly configured, and running.

  • The brick_simple client API uses the OS clock for automatic generation of timestamps for each key update. The application problems caused by badly out-of-sync OS clocks can be easily avoided by NTP.
  • If a client’s clock is skewed by more than the brick_do_op_too_old_timeout configuration attribute in central.conf (units = milliseconds), then the brick will silently discard the client’s operation. The only symptoms of this are:
** Client-side timeouts when using the brick_simple, brick_server,
or brick_squorum APIs.

** Increasing n_too_old statistic counter on the brick.

=== Notes on Hibari Configuration

There are several reasons why disk I/O rates can temporarily increase within a Hibari physical brick:

  • Logical brick checkpoints for increased write I/O ops, see xref:checkpoints[]
  • The common log “scavenger” for increased read and write I/O ops, see xref:scavenger[]
  • Chain replication repair, see xref:chain-repair[]
** As the upstream/”repairer” brick, the extra read I/O ops,
if the brick stores value blobs on disk

** As the downstream/”repairee” brick, extra write I/O ops

The Hibari central.conf file contains parameters that can limit the amount of disk bandwidth used by most of these operations.

See also:

  • xref:considerations-cpu[]
  • xref:central-conf-parameters[]

=== Notes on Monitoring a Hibari Cluster

The Admin Server’s status page contains current status information regarding all tables, chains, and bricks in the cluster. By default, this service listens to TCP port 23080 and is reachable via HTTP at http://any-hibari-node-name:23080/. HTTP redirect will steer your browser to the Admin Server node.

  • Hypertext links for each table, chain, and brick can show more detailed info on each entity.
  • The “Dump History” link at the bottom of the Admin Server’s HTTP status page can show operations history across multiple bricks, chains, and/or tables by using the regular expression feature.
  • Each logical brick maintains counters of each type of Hibari client op primitive. At present, these stats are only exposed via the HTTP status server or by the native Erlang interface, but it’s possible to expose these stats via SNMP and other protocols in a straightforward manner.
** Stats include: number of add, replace, set, get,
get_many, delete, and micro-transactions.

==== Hibari Admin Server HTTP status

For example screen shots of the Admin Server status pages (a work in progress), see link:./misc-screenshots/admin-server-status/index.html[].

See also:

  • xref:chain-lifecycle-fsm[]
  • xref:brick-lifecycle-fsm[]

Administering Hibari Through the API

  • Add a new table
  • Delete a table
  • Change to a single chain:

** Add one or more bricks (increase replication factor) ** Remove one or more bricks (decrease replication factor) * Change to a single table. ** Add a new chain ** Remove a chain ** Change the chain weighting factor ** Change consistent hashing parameters

[[add-a-new-table]] === Add a New Table: brick_admin:add_table()

[[why-use-hash-prefixes]] ==== Why use hash prefixes?

Hash prefixes allow Hibari servers to guarantee the application developer that certain keys will always be stored on the same chain and therefore always on the same set of bricks. With this guarantee, an application aware of hash prefixes can use micro-transactions successfully.

For example, assume the application requires a collection of persistent stacks that are stored in Hibari.

  • Each stack is identified by a string/binary. (The two types are identical for the sake of discussion.)
  • Each item stored on the stack is a string.
  • Support stack options push & pop.
  • Support quick stack stats, e.g. # of elements on the stack and # of bytes stored on the stack.
  • Stacks may contain hundreds of thousands of items.
  • The total size of a stack will not exceed the total storage capacity of any single brick in the cluster.

IMPORTANT: Understanding the last assumption is vital. Because all keys with the same hash prefix H will be managed by the same chain C, then all bricks in C must have enough capacity to store all H prefix keys.

The application developer then makes the following decisions:

  1. The application will use a table devoted to storing stacks, called ‘stack’.
  2. We know that the application requires strong durability (which is the Hibari default) and that the sum total of all stack items will exceed a single brick’s RAM capacity. Therefore, the ‘stack’ table must store its value blobs on disk. Read access to the table will be slower than if value blobs were stored in RAM, but the limited RAM capacity of bricks does not give us a choice.
  3. We have two machines, boxA and boxB, available for hosting the table’s logical bricks. We want to be able to survive at least one physical brick failure, therefore all chains have a minimum length of 2.
** We will use two chains, so that each physical machine (when up and
running smoothly) will have 2 logical bricks for the table, one in the chain head role and one in the chain tail role.
** The naming scheme used for each chain name and brick name can be
arbitrary, as long as all names are unique. However, for ease-of-management purposes, the use of a systematic naming scheme is strongly encouraged. The scheme used here numbers each chain (starting at 1) and numbers each brick (also starting at 1) with both the chain and brick number.

4. We use the following key naming convention: ** A stack’s metadata (item count, byte count) uses <<”/StackName/md”>>. ** A item uses <<”/StackName/N”>> where N is the item number. 5. We create the table using the following: + ———————— Opts = [{hash_init, fun brick_admin:chash_init/3}, {prefix_method, var_prefix},

{num_separators, 2}, {prefix_separator, $/}, {new_chainweights, [{stack_ch1, 100}, {stack_ch2, 100}]}, {bigdata_dir, ”.”}, {do_logging, true}, {do_sync, true}].
ChainList = [{stack_ch1, [{stack_ch1_b1, hibari1@boxA},
{stack_ch1_b2, hibari1@boxB}]},
{stack_ch1, [{stack_ch2_b1, hibari1@boxB},
{stack_ch2_b2, hibari1@boxA}]}].

brick_admin:add_table(stack, ChainList, Opts).

See xref:examples-using-the-stack[] for sample usage code.

[[types-of-brick-admin-add-table]] ==== Types for brick_admin:add_table()

add_table(Name, ChainList, BrickOptions)
when is_atom(Name), is_list(ChainList) equivalent to add_table(brick_admin, Name, ChainList, BrickOptions)
add_table(ServerRef, Name, BrickOptions)
when is_atom(Name), is_list(BrickOptions) equivalent to add_table(ServerRef, Name, ChainList, [])
add_table(ServerRef::gen_server_serverref(), Name::table(),
ChainList::chain_list(), BrickOptions::brick_options())
-> ok |
{error, term()} | {error, term(), term()}

gen_server_serverref() = “ServerRef” type from STDLIB gen_server, gen_fsm, etc. proplists_property() = “Property” type from STDLIB proplists

bigdata_option() = {‘bigdata_dir’, string()} brick() = {logical_brick(), node()} brick_option() = chash_prop() |

custom_prop() | fixed_prefix_prop() | {‘hash_init’, fun/3} | var_prefix_prop()

brick_options() = [brick_option] chain_list() = {chain_name(), [brick()]} chain_name() = atom() chash_prop() = {‘new_chainweights’, chain_weights()} |

{‘num_separators’, integer()} | {‘old_float_map’, float_map()} | {‘prefix_is_integer_hack’, boolean()} | {‘prefix_length’, integer()} | {‘prefix_method’, ‘all’ | ‘var_prefix’ | ‘fixed_prefix’} | {‘prefix_separator’, integer()}

chain_weights() = [{chain_name, integer()}] custom_prop() = proplists_property() fixed_prefix_prop() = {‘prefix_is_integer_hack’, boolean()} |

{‘prefix_length’, integer()}

logging_option() = {‘do_logging’, boolean()} logical_brick() = atom() node() = atom() sync_option() = {‘do_sync’, boolean()} table() = atom() var_prefix_prop() = {‘num_separators’, integer()} |

{‘prefix_separator’, integer()}

{‘bigdata_dir’, string()}:: To store value blobs on disk (i.e. “big data” is true), specify this value with any string (the string’s actual value is not used). + IMPORTANT: To store value blobs in RAM, this option must be omitted. + {‘do_logging’, boolean()}:: Specify whether all bricks in the table will log updates to disk. If not specified, the default is true. + {‘do_sync’, boolean()}:: Specify whether all bricks in the table will synchronously flush all updates to disk before responding to the client. If not specified, the default is true. + {‘hash_init’, fun/3}:: Specify the hash initialization function. Of the four hash methods bundled with Hibari, we recommend using brick_hash:chash_init/3 only. + {‘new_chainweights, chain_weights()}:: (For brick_admin:chash_init/3) Specify the chainweights for this new table. For creating a new table, this option is not used. However, this option is used when changing a table to add/remove chains or to change other table-related parameters. + {‘num_separators’, integer()}:: (For brick_admin:chash_init/3 and brick_admin:var_prefix_init/3) For variable prefix hashes, this option specifies how many instances of the variable prefix separator character (see ‘prefix_separator’ below) are included in the hashing prefix. The default is 2. + For example, if {‘prefix_separator’, $/}, then + ** With {‘num_separators’, 2} and key <<”/foo/bar/baz/hello”>>,

the hashing prefix is <<”/foo/”>>.
** With {‘num_separators’, 3} and key <<”/foo/bar/baz/hello”>>,
the hashing prefix is <<”/foo/bar/”>>.

{‘old_float_map’, float_map()}:: Specify the old version of the “float map”. For creating a new table, this option is not used. However, this option is used when changing a table to add/remove chains or to change other table-related parameters: it is used to create a new mapping of {table, key} -> chain that relocates only a minimum number of keys a new chain. + {‘prefix_method’, ‘all’ | ‘var_prefix’ | ‘fixed_prefix’}:: (For brick_admin:chash_init/3) Specify which prefix method will be used for consistent hashing: + ** ‘all’: Use the entire key ** ‘var_prefix’: Use a variable-length prefix of the key ** fixed_prefix’: Use a fixed-length prefix of the key + {‘prefix_is_integer_hack’, boolean()}:: (For brick_admin:fixed_prefix_init/3) If true, the prefix should be interpreted as an ASCII representation of a base 10 integer for use as the hash calculation. + {‘prefix_length’, integer()}:: (For brick_admin:fixed_prefix_init/3) For a fixed-prefix hashes, this option specifies the prefix length. + {‘prefix_separator’, integer()}:: (For brick_admin:chash_init/3 and brick_admin:var_prefix_init/3) For variable prefix hashes, this option specifies the single byte ASCII value of the byte that separates the key’s prefix from the rest of the key. The default is $/, ASCII 47.

[[examples-using-the-stack]] ==== Examples code for using the stack

.Create a new stack

Val = #stack_md{count = 0, bytes = 0}. brick_simple:add(stack, “/new-stack/md”, term_to_binary(Val)). ————————————————

.Push an item onto a stack

{ok, OldTS, OldVal} = brick_simple:get(stack, “/new-stack/md”). #stack_md{count = Count, bytes = Bytes} = binary_to_term(OldVal). NewMD = #stack_md{count = Count + 1, bytes = Bytes + size(NewItem)}. ItemKey = “/new-stack/” ++ integer_to_list(Count). [ok, ok] = brick_simple:do(stack,

[brick_server:make_txn(),
brick_server:make_replace(“/new-stack/md”,
term_to_binary(NewMD), 0, [{testset, OldTS}]),

brick_server:make_add(ItemKey, NewItem)]).


.Pop an item off a stack

{ok, OldTS, OldVal} = brick_simple:get(stack, “/new-stack/md”). #stack_md{count = Count, bytes = Bytes} = binary_to_term(OldVal). ItemKey = “/new-stack/” ++ integer_to_list(Count - 1). {ok, _, Item} = brick_simple:get(stack, ItemKey). NumBytes = proplists:get_value(val_len, Ps). NewMD = #stack_md{count = Count - 1, bytes = Bytes - size(Item)}. [ok, ok] = brick_simple:do(stack,

[brick_server:make_txn(),
brick_server:make_replace(“/new-stack/md”,
term_to_binary(NewMD), 0, [{testset, OldTS}]),

brick_server:make_delete(ItemKey)]).

Item.

[[delete-a-table]] === Delete a Table

As yet, Hibari does not have a method to delete a table. The only methods available now are:

  • Delete all files and subdirectories from the bootstrap_* brick data directories, restart the Admin Server, and recreate all tables. (Also known as, “Start over”.)
  • Make a backup copy of all bootstrap_* brick data directories before creating a new table. If you wish to undo, then stop Hibari on all Admin Server-eligible nodes, remove the bootstrap_* brick data directories, restore the bootstrap_* brick data directories from the previous backup, then start all of the Admin Server-eligible nodes.

[[change-a-chain-add-remove-bricks]] === Change a Chain: Add or Remove Bricks

Adding or removing bricks from a single chain changes the replication factor for the keys stored in that chain: more bricks increases the replication factor, and fewer bricks decreases it.

.Data types for brick_admin:change_chain_length()

brick_admin:change_chain_length(ChainName, BrickList)

ChainName = atom() BrickList = [brick()]

brick() = {logical_brick(), node()} logical_brick() = atom() node() = atom() ————————————–

See also, xref:example-change-chain-length[brick_admin:change_chain_length() usage examples].

[[change-a-table-add-remove-chains]] === Change a Table: Add/Remove Chains

.Data types for brick_admin:start_migration()

brick_admin:start_migration(TableName, LH)
equivalent to brick_admin:start_migration(TableName, LH, [])

brick_admin:start_migration(TableName, LH, Options) -> {ok, cookie()} |

{‘EXIT’, term()}

TableName = atom() LH = hash_r() Options = migration_options()

cookie() = term() migration_option() = {‘do_not_initiate_serial_ack’, boolean()} |

{‘interval’, integer()} | {‘max_keys_per_chain’, integer()} | {‘max_keys_per_iter’, integer()} | {‘propagation_delay’, integer()}

migration_options() = [migration_option()]

brick_admin:chash_init(‘via_proplist’, ChainList, Options) -> hash_r()

ChainList = chain_list() Options = brick_options() ————————————–

See xref:types-of-brick-admin-add-table[] for definitions of chain_list() and brick_options() types.

The hash_r() type is an Erlang record, #hash_r as defined in the brick_hash.hrl header file. It is normally considered an opaque type that is created by a function such as brick_hash:chash_init/3.

NOTE: The options list passed in argument #3 to brick_admin:chash_init/3 is the same properties list that is used for brick_admin:add_table/3. The difference is that the options that are related strictly to brick behavior, such as the do_logging and do_sync properties, are ignored by chash_init/3.

Once a hash_r() term is created and brick_admin:start_migration/2 is called successfully, the data migration will start immediately.

The cookie() type is an opaque term that uniquely identifies the data migration that was triggered for the TableName table. Another data migration may not be triggered until the current migration has finished successfully.

The migration_option() properties are described below:

{‘do_not_initiate_serial_ack’, boolean()}:: For internal use only, do not use. + {‘interval’, integer()}:: Interval (in milliseconds) to send kick_next_sweep messages. Default = 50. + {‘max_keys_per_chain’, integer()}:: Maximum number of keys to send to any particular chain. Not yet implemented. + {‘max_keys_per_iter’, integer()}:: Maximum number of keys to examine per sweep iteration. Default = 500 for bricks with value blobs in RAM, 25 for bricks with value blobs on disk. + {‘propagation_delay’, integer()}:: Number of milliseconds to delay for each brick’s logging operation. Default = 0.

See also xref:changing-chains-example[].

[[change-a-table-chain-chain-weighting]] === Change a Table: Change Chain Weighting

The functions to change chain weighting are the same for adding/removing chains, see xref:change-a-table-add-remove-chains[] for additional details.

When creating a hash_r() type record, follow these two bits of advice:

  • The chain_list() term remains exactly the same as the chain list currently used by the table. See brick_admin:get_table_chain_list/1 for how to retrieve this list.
  • The new_chainweights property in the brick_options() list specifies a different set of chain weighting factors than is currently used by the table. The current chain weighting list is in the brick_options property returned by the brick_admin:get_table_info/1 function.

See also xref:changing-chains-example[].

[[admin-server-api]] === Admin Server API

See EDoc documentation for brick_admin.erl API.

[[scoreboard-api]] === Scoreboard API

See EDoc documentation for brick_sb.erl API.

[[chain-monitor-api]] === Chain Monitor API

See EDoc documentation for brick_chainmon.erl API.

[[changing-chain-length]] === Changing Chain Length: Examples

The Admin Server’s basic definition of a chain: the chains name, and the list of bricks. In turn, each brick is defined by a 2-tuple of brick name and node name.

Example chain definition, chain length=1
{tab1_ch1, [{tab1_ch1_b1, hibari1@bb3}]}

The function brick_admin:get_table_chain_list/1 will retrieve the active chain definition list for a table. For example, we retrieve the chain definition list for the table tab1. The node bb3 is the hostname of my laptop.

(hibari1@bb3)24> Tab1ChList. [{tab1_ch1,[{tab1_ch1_b1,hibari1@bb3}]}] ———-

NOTE: The brick_admin:get_table_chain_list/1 function will retrieve the active chain definition list for a table: only bricks that are in ok state will be shown. If a chain has a brick that has crashed, that brick will not appear in the list returned by this function. The brick_admin:get_table_info() function can fetch the list of all bricks, in service and crashed, but the API is not as convenient.

[[example-change-chain-length]] To change the chain length, use the brick_admin:change_chain_length/2 function. The arguments are the chain name and brick list.

NOTE: Any bricks in the brick list that aren’t in the chain are automatically started. Any bricks in the current chain that are not in the new list are halted, and their persistent data will be deleted.

// JWN: The deletion is not immediate on disk - correct? Scavenger is // needed - right?

ok

(hibari1@bb3)30> {ok, Tab1ChList2} = brick_admin:get_table_chain_list(tab1). {ok,[{tab1_ch1,[{tab1_ch1_b1,hibari1@bb3},
{tab1_ch1_b2,hibari1@bb3}]}]}

Now the tab1_ch1 chain has length two. We’ll shorten it back down to length 1.

(hibari1@bb3)32> {ok, Tab1ChList3} = brick_admin:get_table_chain_list(tab1). {ok,[{tab1_ch1,[{tab1_ch1_b2,hibari1@bb3}]}]} —————-

NOTE: A chain’s new brick list must contain at least one brick from the current chain’s definition. If the intersection of old brick list and new brick list is empty, the command will fail.


[[changing-chains-example]] === Creating and Rebalancing Chains: Examples

The procedure for creating new chains, deleting existing chains, and reweighing existing chains, and rehashing is done using the the brick_admin:start_migration() function. The chain definitions are specified in the same way as changing chain lengths, see xref:changing-chain-length[] for details.

The data structure required by brick_admin:start_migration/2 is more complex than the relatively-simple brick list that brick_admin:change_chain_length/2 requires. This section will demonstrate the creation of this structure, the ``local hash record’‘, step-by-step.

First, we create a new chain definition list. (Refer to xref:changing-chain-length[] if necessary.) For this example, we’ll assume that we’ll be modifying the tab1 table and that we’ll be adding two more chains. Each chain will be of length one. We’ll place each chain on the same node as everything else, hibari1@bb3 (i.e. my laptop).

(hibari1@bb3)49> NewCL = [{tab1_ch1, [{tab1_ch1_b1, hibari1@bb3}]},
{tab1_ch2, [{tab1_ch2_b1, hibari1@bb3}]}, {tab1_ch3, [{tab1_ch3_b1, hibari1@bb3}]}].
[{tab1_ch1,[{tab1_ch1_b1,hibari1@bb3}]},
{tab1_ch2,[{tab1_ch2_b1,hibari1@bb3}]}, {tab1_ch3,[{tab1_ch3_b1,hibari1@bb3}]}]

NOTE: Any bricks in the brick list that aren’t in a chain are automatically started. Any bricks in a current chains that are not in the chain definition are halted, and their persistent data will be deleted.

Next, we retrieve the table’s current hashing configuration. The data is returned to us in the form of an Erlang property list. (See the Erlang/OTP documentation for the proplists module, located in the “Basic Applications” area under “stdlib”.) We then pick out several properties that we’ll need later; we use lists:keyfind/3 instead of a function in the proplists module because it will preserve the properties in 2-tuple form, which will save us some typing effort later.

...lots of stuff omitted...

(hibari1@bb3)53> Opts = proplists:get_value(brick_options, TabInfo). [{hash_init,#Fun<brick_hash.chash_init.3>},

{old_float_map,[]}, {new_chainweights,[{tab1_ch1,100}]}, {hash_init,#Fun<brick_hash.chash_init.3>}, {prefix_method,var_prefix}, {prefix_separator,47}, {num_separators,3}, {bigdata_dir,”cwd”}, {do_logging,true}, {do_sync,true}, {created_date,{2010,4,17}}, {created_time,{17,21,58}}]

(hibari1@bb3)58> PrefixMethod = lists:keyfind(prefix_method, 1, Opts). {prefix_method,var_prefix}

(hibari1@bb3)59> NumSep = lists:keyfind(num_separators, 1, Opts). {num_separators,3}

(hibari1@bb3)60> PrefixSep = lists:keyfind(prefix_separator, 1, Opts). {prefix_separator,47}

(hibari1@bb3)61> OldCWs = proplists:get_value(new_chainweights, Opts). [{tab1_ch1,100}]

(hibari1@bb3)62> OldGH = proplists:get_value(ghash, TabInfo).

(hibari1@bb3)63> OldFloatMap = brick_hash:chash_extract_new_float_map(OldGH).

Next, we create a new property list.

(hibari1@bb3)72> NewOpts = [PrefixMethod, NumSep, PrefixSep,
{new_chainweights, NewCWs}, {old_float_map, OldFloatMap}].
[{prefix_method,var_prefix},

{num_separators,3}, {prefix_separator,47}, {new_chainweights,[{tab1_ch1,100},

{tab1_ch2,100}, {tab1_ch3,100}]}

{old_float_map, []}]


Next, we use the chain definition list, NewCL, and the table options list, NewOpts, to create a ``local hash’’ record. This record will contain all of the configuration information required to change a table’s consistent hashing characteristics.

...lots of stuff omitted...

[[chash-migration-pre-check]] We’re just one step away from changing the tab1 table. Before we change the table, however, we’d like to see how the table change will affect the data in the table. First, we add 1,000 keys to the tab1 table. Then we use the brick_simple:chash_migration_pre_check/2 function to tell us how many keys will move and to where.

ok,ok,ok,ok,ok,ok,ok,ok,ok,ok|...]

(hibari1@bb3)75> brick_simple:chash_migration_pre_check(tab1, NewLH). [{keys_before,[{tab1_ch1,1001}]},

{keys_keep,[{tab1_ch1,348}]}, {keys_moving,[{tab1_ch2,315},{tab1_ch3,338}]}, {keys_moving_where,[{tab1_ch1,[{tab1_ch2,315},

{tab1_ch3,338}]}]},

{errors,[]}]


The output above shows us that of the 1,001 keys in the tab1 table, 348 will remain in the tab1_ch1 chain, 315 keys will move to the tab1_ch2 chain, and 338 keys will move to the tab1_ch3 chain. That looks like what we want, so let’s reconfigure the table and start the data migration.

brick_admin:start_migration(tab1, NewLH).

Immediately, we’ll see a bunch of application messages sent to the console as new activities start:

  • A migration monitoring process is started.
  • New brick processes are started.
  • New monitoring processes are started.
  • Data migrations are started and finish
  • The migration monitoring process exits.

=GMT INFO REPORT==== 20-Apr-2010::00:26:41 === progress: [{supervisor,{local,brick_mon_sup}},

{started,
[{pid,<0.2937.0>},
{name,chmon_tab1_ch2}, ...stuff omitted...

[...lines skipped...] =GMT INFO REPORT==== 20-Apr-2010::00:26:41 === Migration monitor: tab1: chains starting

[...lines skipped...] =GMT INFO REPORT==== 20-Apr-2010::00:26:41 === brick_admin: handle_cast: chain tab1_ch2 in unknown state

[...lines skipped...] =GMT INFO REPORT==== 20-Apr-2010::00:26:52 === Migration monitor: tab1: sweeps starting

[...lines skipped...] =GMT INFO REPORT==== 20-Apr-2010::00:26:54 === Migration number 1 finished

[...lines skipped...] =GMT INFO REPORT==== 20-Apr-2010::00:26:57 === Clearing final migration state for table tab1 —————

For the sake of demonstration, now let’s see what brick_simple:chash_migration_pre_check() would say if we were to migrate from three chains to four chains.

(hibari_dev@bb3)25> Opts3 = proplists:get_value(brick_options, TabInfo3).

(hibari_dev@bb3)26> GH3 = proplists:get_value(ghash, TabInfo3).

(hibari_dev@bb3)28> OldFloatMap = brick_hash:chash_extract_new_float_map(GH3).

(hibari_dev@bb3)31> NewOpts4 = [PrefixMethod, NumSep, PrefixSep,
{new_chainweights, NewCWs4}, {old_float_map, OldFloatMap}].
(hibari_dev@bb3)35> NewCL4 = [ {tab1_ch1, [{tab1_ch1_b1, hibari1@bb3}]},
{tab1_ch2, [{tab1_ch2_b1, hibari1@bb3}]}, {tab1_ch3, [{tab1_ch3_b1, hibari1@bb3}]}, {tab1_ch4, [{tab1_ch4_b1, hibari1@bb3}]} ].

(hibari_dev@bb3)36> NewLH4 = brick_hash:chash_init(via_proplist, NewCL4, NewOpts4).

(hibari_dev@bb3)37> brick_simple:chash_migration_pre_check(tab1, NewLH4). [{keys_before,[{tab1_ch1,349},

{tab1_ch2,315}, {tab1_ch3,337}]},

{keys_keep,[{tab1_ch1,250},{tab1_ch2,232},{tab1_ch3,232}]}, {keys_moving,[{tab1_ch4,287}]}, {keys_moving_where,[{tab1_ch1,[{tab1_ch4,99}]},

{tab1_ch2,[{tab1_ch4,83}]}, {tab1_ch3,[{tab1_ch4,105}]}]},

{errors,[]}]


The output tells us that chain tab1_ch1 will lose 99 keys, tab1_ch2 will lose 83 keys, and tab1_ch3 will lose 105 keys. The final key distribution across the four chains would be 250, 232, 232, and 287 keys, respectively.

Hibari Community

TODO

Inside Hibari

Hibari Contributor’s Guide (Hibari v0.1.11)

DRAFT - IN PROGRESS

Date: 2015/03/22
Revision: 0.5.4

Copyright (C) 2005-2015 Hibari developers. All rights reserved.

Table of Contents

Misc