Metatron User Manual¶
Discovery Quick Guide¶
Metatron Discovery is an all-in-one solution that enables rapid loading, pre-processing, and analysis of large amounts all together. With Metatron Discovery, business users without technical knowledge can directly work with data and gain insights from rapid visualization.
You can perform data analysis with Metatron Discovery using the two methods below:
Method 1: Run Metatron Discovery demo site. Enter “metatron” as your ID and password.
Method 2: Download the single-mode Metatron Discovery to your local PC. Download is provided in three ways.
- Custom install: Download the source code from the Github repository, or directly run the build file.
- Virtual machine: Run the virtual machine image. This is also available in the Windows OS.
- Docker: Run the Docker image for a quick installation.
Do you see the screen below? Congratulations! You are now ready for quick and easy data analysis with Metatron Discovery.
For a quick start, follow the three-step tutorial below:
Step 1. Create a data source¶
The first step in data analysis is ingesting your data into the system. Metatron Discovery allows you to easily ingest various data sources.
The example in this tutorial shows you how to ingest data from your local directory. First, prepare data. An Excel file (.xls, .xlsx) or .csv file will suffice. This tutorial uses sales data. Download it from the link below:
Data sources can be viewed and ingested from Management > Data Storage > Data Source. To create a new data source, click the New button on the upper right of the data source list.
In this tutorial, click File to retrieve the data from your local directory. See Create a data source for details on creating a data source from other sources.
Drag and drop the data you wish to analyze, or retrieve it from the directory.
Drag your cursor over the sales data to view up to 100 rows of data with detection of the column delimiter and line separator. This data is properly displayed using the default delimiter and separator. Click Next.
While viewing the data, adjust the column types properly. This task is called data schema configuration.
Each column functions as a “dimension” or “measure.” See “Dimensions” and “Measures” for further details. In this data, the Discount
, Profit
, Quantity
, Sales
, DaystoShipActual
, SalesForecast
, DaystoShipScheduled
, SalesperCustomer
, and ProfitRatio
columns must be converted into measures.
Next, the data types of columns must be adjusted properly. The string type is the default setting for dimensions, and the integer type for measures. While viewing the sample, change the data type settings properly. Below is a list of items to be modified in this data.
Orderdate
: Date/TimeDiscount
: DecimalShipDate
: Date/Time (Change the time format to yyyy. MM. dd. and click the checkbox to validate)SalesperCustomer
: DecimalProfitRatio
: Decimallatitude
: Latitudelongitude
: Longitude
Lastly, you should create a new column. Since we already have columns for latitude and longitude, we can create a point type column. Click the Add column button on the upper right. Select the latitude
column for the Latitude column, and the longitude
column for the Longitude column. Name the columns appropriately, and click Add. A new point type column is created!
Once you are done with schema configuration, click Next. If necessary, you can change the settings for ingestion into Druid. The default settings are sufficient for now.
Lastly, enter the Name and Description for the data source. Click Done to proceed to the data source details page.
In the data source details page, you can view the ingestion status in real time. The screen below appears after a few minutes, indicating success. A histogram is displayed. If you encounter an error while ingesting another data source, click Details to view the Druid ingestion log. Ingestion may be unsuccessful due to a duplicate column name or mismatch between column types and their data. Try ingestion again after addressing the issue.
To make the data source available to other users, check the checkbox next to Allow all workspaces to use this datasource under Publish. To make the data source available only to specific users, click Edit and select individual users’ or teams’ workspaces as desired.
In this example, we will choose Open Data to make it available to all users.
The ingested data can be viewed under the Data tab.
Congratulations! Now, it’s time to use the data source. Let’s proceed to the next step.
Step 2. Create a workbook¶
Do you have the data ready for analysis? Now, it’s time to create a workbook. The Workbook module supports the visualization of data. Click the Metatron Discovery logo on the upper left to enter your personal workspace.
Let’s begin by clicking the + Workbook button on the bottom right. Enter the name and description for the workbook. The checkbox is marked by default for you to create a dashboard once a workbook is created. A single workbook contains multiple dashboards, and each single dashboard contains multiple charts.
Proceed with creating a dashboard. A dashboard requires a data source for visualization. This data source can be either a single source, or joined data sources. See Create a dashboard for further details. This tutorial uses Sales Report
, ingested previously in Step 1.
Click the + Add data source button for the data source selection popup. Search Sales Report
, or select the Show open data only checkbox and choose from the results.
Finally, enter the Name and Description for the dashboard.
You have created a dashboard in the workbook. Now, you can add widgets to the dashboard.
Let’s proceed to the next step.
Step 3. Organize a dashboard¶
The final step is to create chart widgets, text widgets, and filter widgets to fill the empty dashboard. The dashboard can be edited in the following order:
Using the Sales Report
created earlier, let’s add a key performance indicator chart and a line chart to the dashboard.
In the empty dashboard, click the Chart button to create a chart.
Creating a key performance indicator chart¶
The first chart you will be creating is a key performance indicator (KPI) chart. The KPI chart is a simple yet powerful chart that displays the goals of an organization in an intuitive manner. The goal of our dashboard is to clearly present sales data. As such, the KPI chart should include total sales, sales forecast, and profit. What should we do? Simply click the three measurement columns named “Sales,” “SalesForecast,” and “Profit” under the Data menu. This task is called pivoting. The pivoted columns are automatically aggregated and placed on shelves. Once columns are on shelves, suitable charts are recommended. How about clicking the recommended KPI chart?
The KPI chart is created as follows: To make it more presentable, let’s enter the chart properties menu on the right.
Click to enter the Common Setting panel and add an icon to each measure column.
Click to enter the Number Format panel and change the decimal place and abbreviation display.
The most important feature of the KPI chart is comparing present achievements with past performance. Click to enter the Set up secondary indicators panel. Set a secondary indicator, and check the % improvement in performance compared to the previous month. If you wish, you can emphasize the secondary indicator instead of the original indicator.
Click Save to display the chart in the dashboard.
Creating a line chart¶
Next, let’s create a line chart, the most basic type of chart. Shall we take a look at how sales and profit change over time? Again, click the Chart button to begin drawing a new chart. Click the OrderDate
, Profit
, and Sales
columns to see how the values change over time. Click the recommended Line Chart.
A line chart is drawn. Open the chart properties panel, and change the line shape to “round.”
There is too much data as OrderDate
is aggregated on an hourly basis. To view by month, go to the menu of the OrderDate
column, and set Granularity as Month. The entire data is displayed now! Click Mini Map on the upper right to remove the mini map from the chart.
Click on the right menu, and change colors using the Color Setting panel.
Click Save, and drag and drop the chart to the desired position. Add information to the dashboard by adding a text widget. Click Done to finish dashboard editing.
In this tutorial, you learned how to draw two chart types. Using the interactive dashboard, you can select a chart or add filters to present data as desired. You can also modify, add, or delete charts if required.
Are you ready to learn more about Metatron Discovery?
Introduction of Metatron Discovery¶
Metatron Discovery is a solution that analyzes data ingested into the Metatron server cluster in a simple, sophisticated manner, and visualizes the results in the user PC in the form of charts and reports. A web-based application, it is highly accessible such that it can be remotely accessed by from any PC.
This section introduces the technical background and structure of Metatron Discovery, and the Druid engine powering Metatron.
Overview of Metatron Discovery¶
Metatron Discovery is a 4th-generation OLAP-based business intelligence (BI) solution that combines OLAP, visualization, and machine learning technologies for even non-experts to quickly and easily derive higher-level value from data.
4th-generation BI solution¶
The figure below shows BI trends from the 1st to 4th generation.
The mainstream products in the current BI market belong to the 2nd and 3rd generations, and 4th generation products are beginning to come under the spotlight. As a 4th generation BI solution, Metatron Discovery supports self & ad-hoc data discovery and guarantees rapid response to big data.
Built on Big OLAP¶
Metatron Discovery combines data of various dimensions for large-sized fact data to produce a single big OLAP cube (data mart).
The use of a big OLAP cube offers the following advantages:
Minimizes the number of data marts.
- Lower ETL cost for data mart production.
- Influence of structural change can be minimized.
- Satisfies diverse demands by saving all fact data.
Distributed architecture allows storing of large-scale data and ensures fast data processing.
With a dynamic schema approach, schema changes do not require schema redefining.
Data can be processed at the record level in real time as tables are saved with no data loss.
Architecture of Metatron Discovery¶
Metatron Discovery is an end-to-end solution that supports the entire process of data discovery, from preparation of large-scale data to data visualization and exploration and to advanced analytics. The figure below is a summary of Metatron’s architecture and key features.
Components of Metatron Discovery¶
Metatron Discovery performs analytics on its ingested data sources or other external data sources using various analytical tools and outputs analytical results in charts and reports. To utilize this system, you must understand its overall structure shown below:
Data Preparation¶
Data Preparation refines data from source data to be ingested into Metatron. See Data Preparation for details on data preparation.
Data Storage¶
Data Storage manages data ingested into the Metatron engine for analysis and visualization. See Data Management for details on data management.
Data analysis and visualization¶
Each module below allows users to perform visualization-based exploration and analysis of stored data.
Workspace¶
Workspace provides an interface to manage its workbooks, workbenches, and notebooks used in an organization according to user access. See Workspace for details on the use of the workspace.
Workbook, dashboard, chart¶
Workbook supports working on, sharing, and making a presentation with dashboards and charts using a PowerPoint-like interface. See Workbook for details on the workbook module.
Data Monitoring¶
This function monitors data use based on data query statistics and query logs. See Data Monitoring for details on the data monitoring functionality.
User permission and account management¶
You can add/delete users or manage user permission.
Metatron engine: Druid¶
The development of information and communications technology has been accompanied by a rapid increase in the amount of data generated, highlighting the importance of efficient data collection, management, and utilization. However, RDBMS-based legacy tools are unable to process mass amounts of multidimensional data. This has led to the emergence of new methodologies and solutions aimed at satisfying the demand for big data.
Metamarkets, a technology startup based in Silicon Valley, launched a column-oriented distributed data store known as Druid in 2011, and open sourced it in October 2012. Many companies have turned to Druid for their backend technology because it offers various advantages, including fast and efficient data processing.
As a B2C telecommunications service provider, SK Telecom recognized the need to effectively manage and analyze the vast amounts of network data generated by its users every minute. Metatron, an end-to-end business intelligence solution with Druid as the underlying engine, was thus developed and launched in 2016.
The following sections discuss the features of Druid that make it suitable for time-series data processing, and introduce how they were adapted and improved by SK Telecom for Metatron.
Background of Druid development¶
Druid was originally designed to satisfy the following needs around ingesting and exploring large quantities of transactional events (log data):
- The developers wanted to be able to rapidly and arbitrarily slice and dice data and drill into that data effectively without any restrictions, along with sub-second queries over any arbitrary combination of dimensions. These capabilities were needed to allow users of their data dashboard to arbitrarily and interactively explore and visualize event streams.
- The developers wanted to be able to ingest events and make them exportable almost immediately after their occurrence. This was crucial to enable users to collect and analyze data in real time for timely situational assessments, predictions and business decisions. Popular open source data warehousing systems such as Hadoop were unable to provide the sub-second data ingestion latencies as required.
- The developers wanted to ensure multitenancy and high availability for their solution services. Their systems needed to be constantly up and be able to withstand all sorts of potential failures without going down or taking any downtime. Downtime is costly and many businesses cannot afford to wait if a system is unavailable in the face of software upgrades or network failure.
Druid features¶
Data table components¶
Data tables in Druid (called data sources) are collections of timestamped events designed for OLAP queries. A data source is composed of three distinct types of columns (here we use an example dataset from online advertising).
![]()
Source: http://druid.io
- Timestamp column: Druid treats timestamp separately in a data source because all its queries center around the time axis (If non-time series data is ingested in batch, all records are timestamped with the current time for use in Druid).
- Dimension columns: Dimensions are string attributes of an event, and the columns most commonly used in filtering the data. Four dimensions are involved in the example dataset: publisher, advertiser, gender, and country. They each represent an axis of the data chosen to slice across.
- Metric columns: Metrics are columns used in aggregations and computations. In the example, the metrics are clicks and price. Metrics are usually numeric values, and computations include operations such as count, sum, and mean (Metatron has extended supported Druid data types).
Data ingestion¶
Druid supports real-time and batch ingestion.
One major characteristic of Druid is real-time ingestion, which is enabled by real-time nodes (For details, see Real-time nodes). Events ingested in real-time from a data stream get indexed in seconds to become queryable in the Druid cluster.
Data roll-up¶
The individual events in our example dataset are not very interesting because there may be trillions of such events. However, summarizations of this type of data by time interval can yield many useful insights. Druid summarizes this raw data when ingesting it using an optional process called “roll-up.” Below is an example of roll-up.
![]()
Source: Interactive Exploratory Analytics with Druid | DataEngConf SF ‘17
The table on the left lists the domain click events that occurred from 00:00:00 to 01:00:00 on January 1, 2011. Since individual events recorded in seconds do not have much significance from the analyst’s perspective, the data was compiled at a granularity of one hour. This results in the more meaningful table on the right, which shows the number of clicks by gender for the same time period.
In practice, rolling up data can dramatically reduce the size of data that needs to be stored (up to a factor of 100), thereby saving on storage resources and enabling faster queries.
But, as data is rolled up, individual events can no longer be queried; the rollup granularity is the minimum granularity you will be able to explore data at and events are floored to this granularity. The unit of granularity can be set as desired by users. If necessary, the roll-up process may be disabled to ingest every individual event.
Data sharding¶
A data source is a collection of timestamped events and partitioned into a set of shards. A shard is called a segment in Druid and each segment is typically 5?10 million rows. Druid partitions its data sources into well-defined time intervals, typically an hour or a day, and may further partition on values from other columns to achieve the desired segment size.
The example below shows a data table segmented by hour:
Segment sampleData_2011-01-01T01:00:00:00Z_2011-01-01T02:00:00:00Z_v1_0
:
2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70
2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18
Segment sampleData_2011-01-01T02:00:00:00Z_2011-01-01T03:00:00:00Z_v1_0
:
2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 1953 17 17.31
2011-01-01T02:00:00Z bieberfever.com google.com Male UK 3194 170 34.01
This segmentation by time can be achieved because every single event in a data source is timestamped.
Segments represent the fundamental storage unit in Druid and replication and distribution are done at a segment level. They are designed to be immutable, which means that once a segment is created, it cannot be edited. This ensures no contention between reads and writes. Druid segments are just designed to be read very fast.
In addition, this data segmentation is key to parallel processing in Druid’s distributed environment: As one CPU can scan one segment at a time, data partitioned into multiple segments can be scanned by multiple CPUs simultaneously in parallel, thereby ensuring fast query returns and stable load balancing.
Data storage format and indexing¶
The way Druid stores data contributes to its data structures highly optimized for analytic queries. This section uses the Druid table below as an example:
![]()
Source: Druid: A Real-time Analytical Data Store
Columnar storage and indexing¶
Druid is a column store, which means each individual column is stored separately. Given that Druid is best used for aggregating event streams, column storage allows for more efficient CPU usage as only the columns pertaining to a query are actually loaded and scanned in that query. In a row oriented data store, all columns associated with a row must be scanned as part of an aggregation. The additional scan time can introduce significant performance degradations. In the example above, the page, user, gender, and city columns only contain strings. Storing strings directly is unnecessarily costly; instead, they can be mapped into unique integer identifiers. For example,
Justin Bieber -> 0
Ke$ha -> 1
This mapping allows the page column to be represented as an integer array where the array indices correspond to the rows of the original dataset. For the page column, we can represent the unique pages as follows:
[0, 0, 1, 1]
Thus, strings are replaced by fixed-length integers in storage, which are much easier to compress. Druid indexes data on a per-shard (segment) level.
Indices for filtering data¶
Druid creates additional lookup indices that facilitate filtering on string columns. Let us consider the above example table again. A query might be: “How many Wikipedia edits were done by users in San Francisco who are also male?” This example query involves two dimensions: City (San Francisco) and Gender (Male). For each dimension, a binary array is created where the array indices represent whether or not their corresponding rows match the query filter, as shown below:
San Francisco (City) -> rows [1] -> [1][0][0][0]
Male (Gender) -> rows [1, 2, 3, 4] -> [1][1][1][1]
And the query filter performs the AND operation between the two arrays:
[1][0][0][0] AND [1][1][1][1] = [1][0][0][0]
As a result, only row 1 is subject to scanning, which retrieves only the filtered rows and eliminates unnecessary workload. And these binary arrays are very easy to compress as well.
This lookup can be used for the OR operation too. If a query filters on San Francisco or Calgary, array indices will be for each dimension value:
San Francisco (City) -> rows [1] -> [1][0][0][0]
Calgary (City) -> rows [3] -> [0][0][1][0]
And then the OR operation is performed on the two arrays:
[1][0][0][0] OR [0][0][1][0] = [1][0][1][0]
Thus the query scans rows 1 and 3 only.
This approach of performing Boolean operations on large bitmap sets is commonly used in search engines.
Query languages¶
Druid’s native query language is JSON over HTTP. Druid queries include:
- Group By
- Time-series roll-ups
- Arbitrary Boolean filters
- Sum, Min, Max, Avg and other aggregation functions
- Dimensional Search
In addition to these, query libraries in numerous languages, including SQL, are developed and shared.
Druid cluster architecture¶
A Druid cluster consists of different types of nodes and each node type is designed to perform a specific set of things:
Real-time nodes¶
Real-time nodes function to ingest and query event streams. The nodes are only concerned with events for some small time range and periodically hand them off to the deep storage in the following steps:
![]()
Source: Druid: A Real-time Analytical Data Store
- Incoming events are indexed in memory and immediately become available for querying.
- The in-memory data is regularly persisted to disk and converted into an immutable, columnar storage format.
- The persisted data is loaded into off-heap memory to be still queryable.
- On a periodic basis, the persisted indexes are merged together to form a “segment” of data and then get handed off to deep storage.
In this way, all events ingested into real-time nodes, regardless before or after persisted, are present in memory (either on- or off-heap) and thus can be queried (queries hit both the in-memory and persisted indexes). This functionality of real-time nodes enables Druid to conduct real-time data ingestion meaning that events can be queried almost as soon as they occur. In addition, there is no data loss during these steps. In addition, there is no data loss during these steps.
Real-time nodes announce their online state and the data they serve in Zookeeper (see External dependencies) for the purpose of coordination with the rest of the Druid cluster.
Historical nodes¶
Historical nodes function to load and serve the immutable blocks of data (segments) created by real-time nodes. These nodes download immutable segments locally from the deep storage and serve queries over those segments (e.g., data aggregation/filtering). The nodes are operationally simple based on a shared-nothing architecture; they have no single point of contention and simply load, drop, and serve segments as instructed by Zookeeper.
A historical node’s process of serving a query is as follows:
![]()
Source: Druid: A Real-time Analytical Data Store
Once a query is received, the historical node first checks a local cache that maintains information about what segments already exist on the node. If information about a segment in question is not present in the cache, the node will proceed to download the segment from deep storage. On the completion of the processing, the segment is announced in Zookeeper to become queryable and the node performs the requested query on the segment.
Historical nodes can support read consistency because they only deal with immutable data. Immutable data blocks also enable a simple parallelization model: historical nodes can concurrently scan and aggregate immutable blocks without blocking.
Similar to real-time nodes, historical nodes announce their online state and the data they are serving in Zookeeper.
Broker nodes¶
Broker nodes understand the metadata published in Zookeeper about what segments are queryable and where those segments are located. Broker nodes route incoming queries such that the queries hit the right historical or real-time nodes. Broker nodes also merge partial results from historical and real-time nodes before returning a final consolidated result to the caller.
Broker nodes use a cache for resource efficiency as follows:
![]()
Source: Druid: A Real-time Analytical Data Store
Once a broker node receives a query involving a number of segments, it checks for segments already existing in the cache. For any segments absent in the cache, the broker node will forward the query to the correct historical and real-time nodes. Once historical nodes return their results, the broker will cache these results on a per-segment basis for future use. Real-time data is never cached and hence requests for real-time data will always be forwarded to real-time nodes. Since real-time data is perpetually changing, caching the results is unreliable.
Coordinator nodes¶
Coordinator nodes are primarily in charge of data management and distribution on historical nodes. The coordinator nodes determine which historical nodes perform queries on which segments and tell them to load new data, drop outdated data, replicate data, and move data to load balance. This enables fast, efficient, and stable data processing in a distributed group of historical nodes.
As with all Druid nodes, coordinator nodes maintain a Zookeeper connection for current cluster information. Coordinator nodes also maintain a connection to a MySQL database that contains additional operational parameters and configurations, including a rule table that governs how segments are created, destroyed, and replicated in the cluster.
Coordinator nodes undergo a leader-election process that determines a single node that runs the coordinator functionality. The remaining coordinator nodes act as redundant backups.
External dependencies¶
Druid has a couple of external dependencies for cluster operations.
- Zookeeper: Druid relies on Zookeeper for intra-cluster communication.
- Metadata storage: Druid relies on a metadata storage to store metadata about segments and configuration. MySQL and PostgreSQL are popular metadata stores for production.
- Deep storage: Deep storage acts as a permanent backup of segments. Services that create segments upload segments to deep storage and historical nodes download segments from deep storage. S3 and HDFS are popular deep storages.
High availability characteristics¶
Druid is designed to have no single point of failure. The different node types operate fairly independent of each other and there is minimal interaction among them. Hence, intra-cluster communication failures have minimal impact on data availability. To run a highly available Druid cluster, you should have at least two nodes of every node type running.
Architecture extensibility¶
Druid features a modular, extensible platform that allows various external modules to be added to its basic architecture. An example of how Druid’s architecture can be extended with modules is shown below:
![]()
Source: MetaMarkets - Introduction to Druid by Fangjin Yang
Metatron, an end-to-end business intelligence solution to be introduced in this paper, was also built by adding various modules to the Druid engine.
Druid performance assessments¶
With Druid being a data store that supports real-time data exploration, its quantitative assessments are focused on two key aspects:
- Query latency
- Ingestion latency
This is because the key to achieving “real-time” performance is to minimize the time spent on query processing and ingestion. A number of organizations and individuals, including the developers of Druid, have established benchmarks for Druid performance assessment based on the two key aspects, and shared how Druid compares to other database management systems.
Self-assessment by Druid developers¶
Druid: A Real-time Analytical Data Store[1] was published by the developers in 2014. Chapter 6. Performance contains details of Druid assessment, with a particular focus on query and ingestion latencies. The benchmarks of Druid performance are briefly introduced in the following sections.
Query latency¶
Regarding Druid’s query latency, the paper discusses two performance assessments?one was conducted on eight data sources that had been most queried at Metamarkets and the other was on TPC-H datasets. In this section, we review the latter assessment. The latencies from querying on TPC-H datasets were measured by comparing with MySQL, and the cluster environment was as follows:
- Druid historical nodes: Amazon EC2 m3.2xlarge instance types (Intel® Xeon® E5-2680 v2 @ 2.80GHz)
- Druid broker nodes: c3.2xlarge instances (Intel® Xeon® E5-2670 v2 @ 2.50GHz)
- Pledged mountain draw converting (subtract soft a3.2analysed repurchase pairs)
The figure below shows the query latencies resulting from Druid and MySQL when tested on the 1GB and 100GB TPC-H datasets:
![]()
Source: Druid: A Real-time Analytical Data Store
By showcasing these results, the paper suggests that Druid is capable of extremely faster query returns compared to legacy relational database systems.
The Druid paper also presents how faster query returns are achieved when multiple nodes are joined together in a cluster. When tested on the TPC-H 100 GB dataset, the performance difference between a single node (8 cores) and six-node cluster (48 cores) was as follows:
![]()
Source: Druid: A Real-time Analytical Data Store
It was observed that not all types of queries achieve linear scaling, but the simpler aggregation queries do, ensuring a speed increment almost proportional to the number of the cores (SK Telecom’s Metatron has made improvements to achieve much more obvious linear scalability).
Ingestion latency¶
The paper also assessed Druid’s data ingestion latency on a production ingestion setup consisting of:
- 6 nodes, totalling 360GB of RAM and 96 cores (12 x Intel®Xeon®E5-2670).
A total of eight production data sources were selected for this assessment. The characteristics of each data source and their ingestion results are shown below. Note that in this setup, several other data sources were being ingested and many other Druid related ingestion tasks were running concurrently on the machines.
![]()
Source: Druid: A Real-time Analytical Data Store
Druid’s data ingestion latency is heavily dependent on the complexity of the dataset being ingested, but the latency measurements present here are sufficient to demonstrate that Druid well addresses the stated problems of interactivity.
Druid performance assessment by SK Telecom¶
SK Telecom also measured the query and ingestion latencies of Druid as detailed below:
Query latency test¶
The conditions of query latency measurement were as follows:
- Data: TPC-H 100G dataset (900 million rows)
- Pre-aggregation granularity: day
- Servers: r3.4xlarge nodes, (2.5GHz * 16, 122G, 320G SSD) * 6
- No. of historical nodes: 6
- No. of broker nodes: 1
The query times for five queries of the TPC-H 100G dataset were as follows (the query times in Hive were also measured as a reference):
![]()
Source: SK Telecom T-DE WIKI Metatron Project
Note
The reasons why the Hive benchmark performed poorly include that some processes were performed through Thrift and the dataset wasn’t partitioned.
Ingestion latency test¶
The conditions of ingestion latency measurement were as follows:
- Ingestion data size: 30 million rows/day, 10 columns
- Memory: 512 GB
- CPU: Intel (R) Xeon (R) Gold 5120 CPU @ 2.20 GHz (56 cores)
- No. of historical nodes: 100
- No. of broker nodes: 2
- Jobs performed by three out of ten middle-manager nodes
- Ingestion tool: Apache Kafka
Data ingestion was performed 100 times under the conditions specified above, and the average ingestion latency was 1.623439 seconds. As illustrated below, ingestion latency was computed as the sum of Kafka ingestion latency, Druid ingestion latency, and Druid query latency.
![]()
Source: SK Telecom T-DE WIKI Metatron Project
Druid assessments by third parties¶
Druid assessment by Outlyer¶
In the Outlyer blog, twenty open source time-series database systems were assessed in a post[2] titled Top 10 Time Series Databases and published on August 26, 2016. The author Steven Acreman ranked Druid in the 8th place, and his set of criteria was as follows:
A summary of Druid assessment by Outlyer¶ Items Druid performance Write performance - single node 25k metrics/sec
Source: https://groups.google.com/forum/#!searchin/druid-user/benchmark%7Csort:relevance/druid-user/90BMCxz22Ko/73D8HidLCgAJWrite performance - 5-node cluster 100k metrics / sec (calculated) Query performance Moderate Maturity Stable Pro’s Good data model and cool set of analytics features. Mostly designed for fast queries over large batch loaded datasets which it’s great at. Con’s Painful to operate, not very fast write throughput. Real time ingestion is tricky to setup.
Druid assessment by DB-Engines¶
DB-Engines[3], an online website, publishes a list of database management systems ranked by their current popularity every months. To measure the popularity of a system, it uses the following parameters:
- Number of mentions of the system on websites: It is measured as the number of results in queries of the search engines Google, Bing and Yandex.
- General interest in the system: For this measurement, the frequency of searches in Google Trends is used.
- Frequency of technical discussions about the system: The ranking list uses the number of related questions and the number of interested users on the well-known IT-related Q&A sites Stack Overflow and DBA Stack Exchange.
- Number of job offers, in which the system is mentioned: The ranking list uses the number of offers on the leading job search engines Indeed and Simply Hired.
- Number of profiles in professional networks, in which the system is mentioned: The ranking list uses the internationally most popular professional networks LinkedIn and Upwork.
- Relevance in social networks. The ranking list counts the number of Twitter tweets, in which the system is mentioned.
As of July 2018, Druid ranked 118th out of a total of 343 systems, and 7th out of 25 time-series database systems.
Comparison with Apache Spark¶
Comparing Druid with Apache Spark is meaningful because both technologies are emerging as next-generation solutions for large-scale analytics and their different advantages make them very complementary when combined together. Metatron also makes use of this combination: Druid as the data storage/processing engine and Spark as an advanced analytics module.
This section briefly introduces a report comparing the performance of Druid and Spark[4][5] published by Harish Butani, the founder of Sparkline Data Inc. Prior to the performance comparison, the report states that the two solutions are in complementary relations, rather than competitors.
Apache Spark characteristics¶
Apache Spark is an open-source cluster computing framework providing rich APIs in Java, Scala, Python, and R. Spark’s programming model is used to build analytical solutions that combine SQL, machine learning, and graph processing. Spark supports powerful functions to process large-scale and/or complex data manipulation workflows, but it isn’t necessarily optimized for interactive queries.
Dataset, queries, performance results¶
For the benchmark, the 10G TPC-H dataset was used. The 10G star schema was converted into a flattened (denormalized) transaction dataset and reorganized to be queryable in Druid and Spark. The sizes of the resulting datasets were:
- TPCH Flat TSV: 46.80GB
- Druid Index in HDFS: 17.04GB
- TPCH Flat Parquet: 11.38GB
- TPCH Flat Parquet Partition by Month: 11.56GB
And then, a number of queries were chosen to test the performance differences in various aspects as shown below:
Queries used for query latency comparison between Druid and Apache Spark¶ Query Interval Filters Group By Aggregations Basic Aggregation. None None ReturnFlag
LineStatusCount(*)
Sum(exdPrice)
Avg(avlQty)Ship Date Range 1995-12/1997-09 None ReturnFlag
LineStatusCount(*) SubQry
Nation, pType
ShpDt Range1995-12/1997-09 P_Type
S_Nation +
C_NationS_Nation Count(*)
Sum(exdPrice)
Max(sCost)
Avg(avlQty)
Count(Distinct oKey)TPCH Q1 None None ReturnFlag
LineStatusCount(*)
Sum(exdPrice)
Max(sCost)
Avg(avlQty)
Count(Distinct oKey)TPCH Q3 1995-03-15- O_Date
MktSegmentOkey
Odate
ShipPriSum(exdPrice) TPCH Q5 None O_Date
RegionS_Nation Sum(exdPrice) TPCH Q7 None S_Nation +
C_NationS_Nation
C_Nation
ShipDate.YearSum(exdPrice) TPCH Q8 None Region
Type
O_DateODate.Year Sum(exdPrice)
The test results are as follows:
![]()
Source: Combining Druid and Spark: Interactive and Flexible Analytics at Scale
- The Filters + Ship Date query provides the greatest performance gain (over 50 times over Spark) when Druid is used. This is not surprising as this query is a typical slice-and-dice query tailor-made for Druid. Along the same lines, TPCH Q7 shows a significant performance boost when running on Druid: milliseconds on Druid vs. 10s of seconds on Spark.
- For TPCH Q3, Q5, and Q8 there is an improvement, but not to the same level as Q7. This is because the OrderDate predicate is translated to a JavaScript filter in Druid, which is significantly slower than a native Java filter.
- The Basic Aggregation and TPCH Q1 queries definitely show improvement. The Count-Distinct operation is translated to a cardinality aggregator in Druid, which is an approximate count. This is definitely an advantage for Druid, especially for large cardinality dimensions.
These results can vary with testing conditions, but one thing is clear: Queries that have time partitioning or dimensional predicates (like those commonly found in OLAP workflows) are significantly faster in Druid.
Implications¶
The testing results showcase that combining the analytic capabilities with Spark and the OLAP and low latency capabilities of Druid can create great synergy. Druid ingests, explores, filters, and aggregates data efficiently and interactively, while the rich programming APIs of Spark enable in-depth analytics. By leveraging these different capabilities, we can build a more powerful, flexible, and extremely low latency analytics solution.
References
[1] |
|
[2] | Steven Acreman. (2016, Aug 26). Top 10 Time Series Databases. Retrieved from https://blog.outlyer.com/top10-open-source-time-series-databases. |
[3] | DB-Engines website. https://db-engines.com, July 2018. |
[4] | Harish Butani. (2018, Sep 18). Combining Druid and Spark: Interactive and Flexible Analytics at Scale. Retrieved from https://www.linkedin.com/pulse/combining-druid-spark-interactiveflexible-analytics-scale-butani. |
[5] | Harish Butani. (2015, Aug 28). TPCH Benchmark. Retrieved from https://github.com/SparklineData/spark-druid-olap/blob/master/docs/benchmark/BenchMarkDetails.pdf. |
Metatron powered by Druid¶
As explained previously, Metatron employs Druid as its underlying engine and has made developments and improvements of Druid for its own uses. This section introduces the background, progress, and results of the adoption of Druid to Metatron.
Metatron development background and Druid integration¶
Metatron as a big data analytics solution¶
As a telecommunications service provider with the most number of subscribers in South Korea, SK Telecom has exerted significant efforts to establish a stable network environment through by using the mass amounts of network data logs generated by its users.
Due to the limitations of existing IT infrastructure in mass data processing, SK Telecom needed a big-data warehousing system (Apache Hadoop) and a big-data analytics solution compatible with the system. The company built its own Hadoop infrastructure to store mass amounts of data at low cost, but faced the following limitations:
- Network data generated by the countless users could not be analyzed in real time. Although it was possible to store and process big data, visualizations could be implemented only with a sampled subset of data in the same way as on legacy systems.
- Having different solutions and different managers support each stage of data analytics, such as ETL, DW, and BI, not only involved significant time and costs, but also resulted in poor data accessibility. An end-to-end solution was needed to analyze all stages at once in a simple and quick manner.
Why the Druid engine¶
Druid was the optimal engine for the Metatron solution because it fulfilled the aforementioned needs with the features below:
- Druid collects mass amounts of data in real time and indexes them into a queryable format, ensuring very fast data aggregations (a few seconds at the slowest) based on distributed processing.
- Druid’s OLAP time-series data format enables analysts to perform data exploration, filtering, and visualization as desired. Such free and flexible data exploration is essential for users to intuitively select the required data and determine correlations between different dimensions on it.
- Druid’s extensible architecture allows modules to be easily added.
Built on this architecture, Metatron is an end-to-end solution that embraces all layers of data collection, storage, processing, analysis, and visualization.
Druid engine integration¶
The Druid engine was integrated in Metatron as follows:
- With Druid as the basic engine for processing/analytics, the GUI was designed to support users in different professional domains and big-data analysts in data-related tasks such as data preparation, analytics, and visualization, as well as the sharing of results.
- IT administrators can manage/monitor data sources in Druid, and they can establish data preparation rules if data sources of higher quality are required.
Druid functions reinforced in Metatron¶
The open-source Druid, despite its strengths in data collection and processing, had to be improved for Metatron to properly function as an end-to-end solution. This section examines the limitations of the open-source Druid and the functions reinforced in Metatron.
Limitations of the open-source Druid¶
The open-source Druid has the following limitations:
- Since Druid does not yet have full support for joins, Metatron uses another SQL engine for data preparation.
- Druid supports only a subset of SQL queries.
- For a data lake, a traditional SQL engine is more appropriate.
- Druid cannot append to or update already indexed segments, except for in some unusual cases.
- Nulls are not allowed.
- Filtering is not supported for metric columns.
- Linear scalability is not ensured. Increasing the number of servers doesn’t improve the performance as much.
- Only a few data types are supported and it is difficult to add a new one.
- The management and monitoring tools are not powerful enough.
Druid functions reinforced in Metatron¶
The following functions of Druid were strengthened in Metatron:
Query functionality improvements
- Improved the functionality of the GroupBy query type.
- Slightly improved the functionality of other types of queries.
Features added
- Virtual columns (map, expression. etc.)
- New metric types (double, string, array, etc.)
- New expression functions
- Druid query results can be stored on the HDFS or exported into a file.
- Queries for meta information and statistics
- New aggregate functions (variance, correlation, etc.)
- (Limited) Window functions (lead, lar, running aggregations, etc.)
- (Limited) Joins
- (Limited) Sub-queries
- Temporary data sources
- Complex queries (data source summarization, correlation between data sources, k-means, etc.)
- Custom columns grouping
- Geographic information system (GIS) supported
- Columnar histograms
- Bit-slice indexing
Index structure improvements
- Histograms for filtering on metrics
- Lucene format supported for text filtering
Connectability with other systems
- Hive storage handler
- Ingestion into Hive tables (based on connection with the Hive metastore)
- Ingestion into the ORC format
- RDBMS data ingestion via based on JDBC
- (Limited) SQL support backported
Miscellaneous improvements
- Bug fixes (+50) and minor improvements
Data Management¶

As shown above, data used by the three Discovery modules (workbook, notebook and workbench) is prepared from various types of source data, engines, and storages. For these operations, data flows need to be standardized and managed, and different types of source data need to be linked.
Source data required for analysis and visualization is either ingested into the Metatron engine as a data source, or linked directly from an external database with a data connection. Data usage can be monitored and tracked using data monitoring.
Data Source¶
In Metatron Discovery, a “data source” refers to a Druid database table into which data is ingested. Based on these data sources, workbooks and notebooks perform data analytics and visualization.
The Data Source menu can be accessed under MANAGEMENT > Data Storage > Data Source on the left-hand panel of the main screen.
“Dimensions” and “Measures”¶
The columns of a data source linked to the dashboard are categorized into dimension and measure columns as explained below. To make full use of Discovery’s data analysis and visualization features, you must understand the concepts of dimensions and measures clearly.
Dimension column
A column containing categorical data with the following characteristics:
- The values in this type of column are not for aggregation but to be categorized (e.g.: Category, Region, Organization)
- By each of these categories, measure values are aggregated.
Measure columns
A column containing quantitative data with the following characteristics:
- The values in this type of column are subject to aggregation or contain quantitative information (e.g.: Sales)
- These values are aggregated based on dimensions.
Data source management home¶
On this home page, you can create, edit and view data sources.
![]()
Status: Filters the data source list by the availability of data sources stored in the data storage.
- Enable: Displays data sources that have been ingested and are available in workbooks or workbenches.
- Preparing: Displays new data sources whose ingestion is in progress.
- Failed: Displays data sources that have not been created properly.
- Disabled: Displays data sources that have been ingested but are not available because of an error in a certain Druid process.
Publish: Filter the data source list by public workspace.
- Open Data: Displays only data sources publicly available in all workspaces.
- Admin Workspace: Displays only data sources available in the administrator workspace.
- Shared workspaces: Displays only data sources available in the selected shared workspaces.
Creator: Filters the data source list by user or group that created the source data.
Created time: Determines whether the data source list is filtered by created or updated time. You can choose from among All, Today, and Last 7 days or specify a time range to display only those entries that were created/updated within the range.
Search by name of data source: Searches the data source list for the name you type in.
Data source list: Lists data sources filtered by specified criteria. Click an entry in the list to view its details. (Refer to Data source details)
Delete: Hover the mouse over a data source to display a trash icon. Click the icon to delete the data source.
Data source details¶
Click a data source listed in the data source management home to view various attributes of that data source. The following subsections describe each area of the data source details. Note that a data source represents a Druid database table stored in Metatron and necessarily includes a timestamp column as a time-series table.
Common top area¶
- Name: Name of the data source. Click on it if you want modify it.
- Description: Description of the data source. Click on it if you want modify it.
- Last update: Shows who and when last updated the data source.
- Delete: Click this icon to display a menu that allows you to delete the data source.
- Tab selection: Each tab displays a specific set of attributes of the data source. Depending upon the type of data source, not all of the three tabs may be displayed. For details on each tab, refer to the relevant subsection below.
Data information area¶
This area displays basic information of the data source.
![]()
Data type: Type of the imported source data from which the data source has been created.
Status: Displays the availability of the data source.
Size: Displays the size of the data source.
Duration: Displays the time range of the timestamps included in the data source.
Timestamp setting: Displays the granularities defined when the data source was created.
- Query Granularity: Defines the minimum time period by which data is queried. This ensures faster returns by aggregating data per granularity interval.
- Segment Granularity: In Druid, a data source is stored into multiple segments to be processed over multiple nodes in the distributed cluster environment. This granularity setting defines the time intervals into which the data source is partitioned.
- Histogram: A graph displaying the size of the data stored within each time interval in Kbytes. This histogram is can be rendered because the Druid engine timestamps every table record.
Publish area
In this area, you can check and set which workspaces have access to the data source.
![]()
- Allow all workspaces to use this data source: Select this check box to make the data source available in all workspaces.
- Edit: Used to allow specific workspaces to access the data source. This button will disappear if the data source is set as open data.
- Number of shared workspaces: Displays how many workspaces have access to the data source.
Change data schema¶
The top section of the column details tab provides a user interface to filter columns by the criteria you define. Columns that meet the criteria are displayed on the left. You can also edit column settings.
Column view/settings
![]()
- Search data: Searches for columns by the column name you type in.
- Role: Displays all, dimension, or measure columns.
- Type: Displays the columns whose data type is selected.
- View all: Clears all filter settings in the Search data, Role, and Type options and returns to view all columns.
- Configure schema: Click this button to prompt a window to edit the current column settings.
- Column list: Lists table columns.
- Column information: Displays attributes of the selected column.
- Column settings: Displays the metadata of the selected column.
- Statistics: Displays the row count and other statistical values of the selected column.
Configure the schema
Provides a user interface for editing the name and type of columns.
![]()
- Role: Displays whether the column is a dimension or measure.
- Name: Displays the actual name of the column.
- Logical name: Allows you to edit the logical name of the column displayed in the system.
- Type: Allows you to edit the logical type (character/integer/date, etc.) of the column.
- Format: Allows you to edit the display format of the column in the case of the column being a timestamp type.
- Description: Allows you to add a detailed description of the column.
Analyze data statistics¶
The Monitoring tab reports the usage of the data source.
Change of transaction
Displays the trend of data source transactions over time.
Changes of data size
Displays the trend of the data source size over time.
Query distribution (during last one week)
![]()
- Query distribution by user (during last one week): Displays a pie chart of query percentages by user for the past week.
- Query distribution by elapsed time (during last one week): Displays a pie chart of query percentages by execution time for the past week.
Query log
Used to view a detailed history of each performed query.
![]()
- Date: Set a time range to display only those queries that were last executed within this time range.
- Query type: Filters the performed queries by type.
- Status: Displays all, succeeded, or failed queries.
- Query list: Lists queries filtered by specified criteria.
- Detail: Click on it to view the query statement.
Create a data source¶
This section explains the process of ingesting various types of source data into the Metatron engine and converting them into data sources.
To create a data source, click the + New button at the top right of the Data Source home screen.
Then, select the type of source data.
![]()
- File: Creates a data source from a file stored on your local PC (for details, refer to Create a data source from a file).
- Database: Creates a data source from an external database (for details, refer to Create a data source from a database).
- Staging DB: Creates a data source from Metatron’s internal Hive database (for details, refer to Create a data source from a staging database).
- Stream: This function is not currently supported.
- Data Snapshot: This function is not currently supported.
- Metatron Engine: Migrates a data source stored in a previous Metatron version (for details, refer to Add a data source with the Metatron engine).
Create a data source from a file¶
Creates a data source from a file stored on your local PC.
On the source data type selection page, select File.
Select a file to be used as a data source from your local PC. You can either click the Import button and select the file, or drag and drop a file to the box. Once a file is selected, click Next.
![]()
From the file, select the sheet to be included in the data source.
Note
If the “No preview data” message is shown in spite of there being data, check whether the Column delimiter and Line Separator have been configured correctly. In this example, the Line Separator must be set to “r”? the carriage return for MS Windows.
![]()
- File name: Name of the imported file. You can replace it with another file.
- File sheet list: Displays the sheets included in the imported file. Select the sheet from which you want to create a data source.
- File sheet name: Name of the currently selected sheet.
- Size: Size of the imported file.
- Column: Number of columns in the imported file.
- Row: Displayed number of rows and total number of rows in the imported file. Enter the number of rows to be displayed on the page.
- Type: Displays how many data types are recognized from the columns. The data type of each column can be modified later.
- Use the first row as the head column: Select the check box to use the first row of the file as column headers. If you don’t select it, a new row is inserted as a column header row.
Configure the schema of the data source.
![]()
Search by column header: Searches the imported file for columns by name.
Role: Displays all, dimension, or measure columns from the imported file.
Recommended filters: Displays columns to which a top-priority filter is applied.
Type: Filters the columns in the imported file by field type.
Column list section: Lists columns filtered by specified criteria. Once you have selected columns, a panel appears at the bottom of the screen. After selecting your desired batch action in the panel, click Apply to perform the batch action on the selected columns.
Individual column settings section: This area is used to set the attributes of a column selected from the column list. Missing is used to set nulls in the column.
- Replace with: Replaces the nulls with the value typed in.
- Discard: Discards the nulls.
- Do not set: Leaves the nulls as nulls. However, the nulls in the timestamp column are mandatorily discarded.
Timestamp setting: Determines how to timestamp each row. You can either designate an existing time-type column as a timestamp column, or create a new time-type column whose values are all timestamped with the current time.
Note
Metatron Druid is a time-series engine that requires a timestamp for each row when a data source is created.
Configure data source ingestion and click Next.
![]()
Segment Granularity: In Druid, a data source is stored into multiple segments to be processed over multiple nodes in the distributed cluster environment. This granularity setting defines the time intervals into which the data source is partitioned.
Query Granularity: Defines the minimum time period by which data is queried. This ensures faster returns by aggregating data per granularity interval.
Rollup: “Data rollup” summarizes data based on its dimension (for details on the concept of data rollup, refer to Data roll-up). A summarization rule might be summing up all values in each column or applying a set of expressions such as
profit=sales=expenses
.Advanced settings: Configures how to ingest data. Type in the text box in the JSON format. For example,
{maxRowsInMemory : 75000, maxOccupationInMemory : -1, maxShardLength : -2147483648, leaveIntermediate : false, cleanupOnFailure : true, overwriteFiles : false, ignoreInvalidRows : false, assumeTimeSorted : false}Confirm the information about the data set from the imported file, enter the Name and Description, and click Done to create a data source. It may take a few seconds or minutes depending on the amount of data as the source data is ingested into the internal Metatron engine (Druid).
![]()
After data ingestion is complete, you can check the status. In the example below, the status is set to ENABLED and a histogram is displayed.
![]()
In the Data tab, you can check the ingested data in the form of a table.
![]()
On the Data Source management home screen, you will find a newly-created data source. While data is being ingested, the status is displayed as Disabled as shown below; the status changes to Enabled once ingestion is complete. After that, you can use the data source.
![]()
Create a data source from a database¶
Creates a data source from an external database.
On the source data type selection page, select Database.
Enter the information to connect the database.
![]()
Ingestion type: Select how to ingest data into the data source.
- Ingested data: Displays data sources that contain data ingested into the Metatron storage.
- Linked data: Displays data sources that load data from linked databases whenever necessary.
Load a data connection: Automatically loads access information for a database that is already registered as a data connection. However, you must verify the connection by clicking the Validation check button.
DB type: Select the type of the database to be connected.
Host: Enter the hostname to connect to the database.
Port: Enter the port to connect to the database.
User name: Enter the username of the database.
Password: Enter the password of the database.
Validation check: Once you fill out all fields, the Test button becomes active. Click on it to verify if the connection is valid: The validity of the connection appears below the button.
Select data. You can either select a table from the connected database, or write a query yourself.
![]()
- Table: Select a database and a table to display the table’s data. Once the data being ingested has been displayed, confirm the data and click Next.
- Query: Write a query to import the data you want, and click Run to display the data in the lower section. Confirm the data and click Next.
The rest of the process is identical to Create a data source from a file. However, when creating a data source from a database, you must configure additional ingestion settings as follows.
![]()
Ingest once: Ingest the data currently stored in the database only this once. When selecting the Limited record count, you can specify how many rows are to be ingested from the first row.
![]()
Ingest periodically: Saves data on a regular basis.
![]()
Create a data source from a staging database¶
Creates a data source from Metatron’s internal Hive database.
On the source data type selection page, select Staging DB.
Once you select the database and its table to connect, the data is displayed.
![]()
The rest of the process is identical to Create a data source from a database.
![]()
Add a data source with the Metatron engine¶
Migrates a data source stored in a previous Metatron version.
On the source data type selection page, select Metatron Engine.
When data sources created in a previous version of Metatron are listed on the left as shown below, select the check boxes of the data sources you want to migrate to the current version.
![]()
Click Done to migrate the selected data sources.
![]()
Data Connection¶
Metatron Discovery can connect to an external database directly. To connect to an external database, you must create and manage a data connection containing the access information to that database. By registering such a data connection, you don’t need to enter the access information each time you connect to the same database.
The Data Connection menu can be accessed under MANAGEMENT > Data Storage > Data Connection on the left-hand panel of the main screen.
Data connection management home¶
On the Data Connection home page, you can create, edit and view database connections.
![]()
- Publish: Filter the data connection list by public workspace.
- Creator: Filter the data connection list by creator.
- DB type: Filter the data connection list by database type (MySQL, PostgreSQL, Hive, or Presto).
- Security: Filter the data connection list by security type (Always connect, connect by user’s account, or connect with ID and password).
- Created time: Filter the data connection list by time of creation (Today, Last 7 days, or Between).
- Search: Search the data connection list by data connection name.
- Number of data connections: Displays how many data connections are returned in the list.
- New: Click on it to create a new data connection.
- Delete: Hover the mouse over a data connection to display a recycle bin icon. Click the icon to delete the data connection.
Create a data connection¶
On the Create data connection screen, enter the required information to create a connection.
![]()
DB type: Four database types are currently supported. (MySQL, PostgreSQL Hive, Presto)
Host: Enter the hostname to connect to the database.
Port: Enter the port to connect to the database.
URL only: Enter a database URL instead of a host and port.
User name: Enter the username of the database.
Password: Enter the password of the database.
Security: Set the type of security to be applied while using the data connection.
- Always connect: Logs in using the account information the user has entered to create a new data connection.
- Connect by user’s account: Logs in using the account information registered in Metatron Discovery.
- Connect with ID and password: Requires to enter the account information every time the data connection is used.
Validation check: Checks whether the connection information entered is valid; the result is shown next to the button. The validity of the connection appears below the button.
Advanced settings: You can add a custom property key and value as options.
Publish: Set which workspaces have access to the data connection.
- Allow all workspaces to use this data connection: Select this check box to make the data connection available in all workspaces.
- Edit: Used to allow specific workspaces to access the data connection. This button will disappear if the data connection is set as open data.
- Number of shared workspaces: Displays how many workspaces have access to the data connection.
Data Monitoring¶
Data monitoring supports monitoring the logs of all queries submitted by users in Metatron Workbench to the staging database (internal Hive database) and external databases connected to Metatron.
The Data Monitoring menu can be accessed under MANAGEMENT > Data Storage > Data Monitoring on the left-hand panel of the main screen.
Log Statistics¶
This page collects and reports various statistics related to the performance of queries in Metatron Discovery. You can view the following nine types of basic statistics.
![]()
- Query success/failure rate: Displays the daily success/failure rates of queries performed in Metatron.
- Query frequency by user: Graph indicating how many queries were performed by each user. Click a bar to view the job log for the user.
- In order of longest: Displays the performed queries in the order of the longest running time.
- Amount of scan data: Displays the performed queries in the order of the highest amount of scanned data.
- Frequency of successful queries: Displays the performed queries in the order of the highest frequency of success.
- Frequency of failed queries: Displays the performed queries in the order of the highest frequency of failure.
- Total memory usage: Displays the performed queries in the order of the largest memory usage in total.
- Total CPU usage: Displays the performed queries in the order of the largest CPU usage in total.
- Resource usage by queue: Displays the resource usage in each YARN queue in the Hadoop environment.
Job Log¶
This page reports the history of all queries performed in Metatron. You can easily view previous jobs by searching the history of queries with your customized filters. The following are the filters applicable to job searching.
![]()
- Status: Filters queries by whether they were successful or failed.
- Limited elapsed time: Filters queries by long running time. You can set a reference time for this filtering.
- Performed start time: Determines a time range by which to filter queries. This time range is based on when each query started running.
- Search by job or application: Searches the query history by query statement or application ID.
- Number of entries: Displays how many queries are returned in the list.
- Job list: Lists queries filtered by specified criteria. Click an entry in the list to view its details.
Query details¶
Click a query listed in the job log home to view details on that query. The following information can be viewed in the details page.
![]()
- Status: Displays whether the query was successful or failed.
- Job name: Statement used to perform the query.
- Start time: Time when the query started running.
- Elapsed time: Time taken to perform the query.
- User: User ID who performed the query.
- Connection: For a query performed in a workbench, the connection information of the database is displayed.
- Recent history of the same connection: For a query performed in a workbench, the latest five queries performed in the database and their results are displayed. Click Detail to pop up a window showing the query statement.
- Plan: Implements the query plan.
Workspace¶

A workspace stores Metatron Discovery’s analytics entities such as workbooks, notebooks, and workbenches. There are two types of workspaces: personal and shared workspaces.
- Personal workspace: A private workspace assigned to each Discovery member. It is accessible only to the owner.
- Shared workspace: A public workspace shared by multiple users. It is used for users to share analytics processes and results with each other. The owner or administrator of a shared workspace can grant various levels of access to Discovery members.
This chapter introduces workspace home page and UI, and then how to use shared workspaces.
Workspace home¶
On the workspace home page, you can perform manage the Metatron Discovery entities (workbooks, notebooks and workbenches) contained in the workspace.
Composition of the workspace home¶
The overall composition of the workspace home is as follows:
![]()
Main menu button: Click this button to open a panel to access another workspace.
Workspace information: Displays the name and description of the workspace. If the logged-in user owns the workspace, an Owner icon will be displayed next to the name of the workspace.
Registered entities: Displays the number of entities registered in the workspace by entity type.
Data source: Displays the number of data sources used in the workspace. Click this area to show a list of these data sources.
Workspace list: Click this button to show a list of shared workspaces. (See Shared workspace list for how to handle it.)
Creation information: Displays who and when created the workspace.
More: Edit the settings of the workspace.
- Edit the name and description: Edits the name and description of the workspace.
- Set shared member & group: Sets the users and groups who can access the workspace. (See Set access permissions for a shared workspace for details.)
- Set notebook server: Sets access information for external analytics tool servers used by the Notebook module.
- Set permission schema: Sets the access permission of each user role for the workspace. (See Set access permissions for a shared workspace for details.)
- Change owner: Changes the owner of the workspace.
- Delete workspace: Deletes the workspace.
Path in the workspace: Displays the current location in the workspace. Click on a parent folder listed in the path to move to that folder.
Create a folder: Click on it to create a new folder in the current location.
Filter/sort the entity list:
- Search: Searches for an entity or folder in the workspace by name.
- Entity type: Displays only your selected type of entities among workbooks, notebooks, and workbenches.
- Sort: Sorts folders and entities by their name or when they were last updated.
- View type: Select either the grid view or list view as the format of how the entities are listed in the workspace.
Folder list: Displays folders that meet search criteria in the current location. Click one to enter that folder. (For details on individual folders, see Folder items)
Entity list: Displays entities that meet search or sorting criteria in the current location. Click an entity to enter its home. (For details on individual entities, see Entity items)
Select/clone/move/delete entity: Select all entities, or clone, move or delete an entity. (See Select/clone/move/delete folder and entity for details.)
Create an entity: Buttons used to create a specific type of entity in the workspace. (For details, see Create a workbook, Create a notebook, and Create a workbench, respectively.)
Folder items¶
When the mouse cursor is over a folder, it is shown as follows:
![]()
- Check box: Used to select the folder. You can clone, move or delete the selected folder.
- Name: Name of the folder.
- Edit: Click on it to modify the name of the folder. This button is displayed only when you hover the mouse over the folder item.
- Delete: Click on it to delete the folder. This button is displayed only when you hover the mouse over the folder item.
Entity items¶
When the mouse cursor is over an entity, it is shown as follows:
![]()
Check box: Used to select the entity. You can clone, move or delete the selected entity.
Entity type: Displays the type of the entity (workbook/notebook/workbench).
Delete: Click on it to delete the entity. This button is displayed only when you hover the mouse over the entity item.
Name: Name of the entity.
Last updated: Displays when the entity was last updated.
Number of data sources/dashboards: This is an exclusive area for the workbook type.
- The number next to the
icon refers to how many data sources are connected to the workbook.
- The number next to the
icon refers to how many dashboards are registered in the workbook.
Select/clone/move/delete folder and entity¶
You can clone, move or delete folders and entities in the workspace. Once you select a folder or entity, the clone, move, and delete buttons in the lower-left corner of the workspace home become active.
![]()
- Select all: Selects all items in the current folder and entity list.
- Clone workbook: This is exclusive for the workbook type. Click this button to clone the selected workbooks.
- Move selections: Moves the selected folders and entities. Workbooks can be moved to another workspace, and other types of items can be moved to another folder in the same workspace. However, it is impossible to move selections when workbooks and other types of entities are selected together.
- Delete: Deletes the selected folders and entities.
Shared workspace¶
A shared workspace is designed for access and use by multiple users. The following subsections describe how to view and create shared workspaces, and explain “permission schema,” which sets which users or groups are allowed to access shared workspaces.
Workbook¶

Workbook is a data visualization module powered by the Metatron Druid engine. As shown in the diagram above, each workbook?a standalone report?consists of multiple dashboards, while each dashboard consists of various charts showing a visualization of source data analysis.
The main features of Workbook are as follows:
- Fast and flexible data analytics over time-series multidimensional data sources.
- Dashboards contain a variety of visualized charts and texts to be compiled into a report for presentations.
- Frequently used algorithms such as clustering, prediction lines, and trend lines can be implemented through a GUI (graphical user interface).
This chapter consists of:
Create a workbook¶
In Metatron Discovery, a workbook functions as a standalone data analytics report. Once a workbook is created, you can store a number of dashboard slides in the workbook and present them in the proper order.
A workbook is created as follows:
Click the + Workbook button at the bottom of the workspace to move to the workbook creation page.
![]()
Enter a name (required) and description for the workbook to be created and click Done. If you select Continue to create a dashboard of a new workbook, you’ll proceed directly to the Create Dashboard page. This option is provided because a workbook cannot work without dashboards in it.
![]()
After clicking the “+ Add Data Source” button in the middle of the screen, select a data source to create a dashboard. For details on how to create a dashboard, refer to Create a dashboard.
![]()
![]()
You can check the new workbook in the workspace home as shown below. Click the workbook to enter it.
Dashboard¶
Stored in a workbook, a dashboard provides functions to analyze and visualize its connected data source as needed. Therefore, an important step to create a dashboard is connecting to a data source.
You can visualize analyses of various data sources into charts and texts; those visualizations are customizable using pivoting, chart mapping, and filtering.
Create a dashboard¶
A dashboard is created as follows:
Click + Add data source on the workbook screen.
![]()
From the list of data sources accessible to the workspace, select the master data sources to which you want to connect the dashboard. In a subsequent step, you can select additional data sources to be joined to these master data sources selected here.
![]()
- Search by data source name: Search for a data source accessible to the workspace by name.
- Show open data only: Displays only those designated as “open data sources.”
- Type: Displays only those data sources that are the connection or collection type.
- Data source list: Lists data sources filtered by specified criteria.
- Data source information: Displays brief information of the data source selected in the list.
If you have selected more than one data source, you can associate them by dragging one data source to another. Associated data sources can be filtered by each other. If you do not want data source association, simply click Done.
![]()
Once you drag a data source to another one, a new window will pop up to prompt you to configure the data source association. Select a column on each table as an association key by which to filter the other data source. And click Done.
![]()
Once you have finished setting up associations between the master data sources, click Done.
![]()
Re-configure master data source associations or add other data sources to be joined to the top data source selected above as described below:
![]()
Master data source association view
: Click on it to add a new master data source.
- Edit association: Click on it to edit an established data source association.
Settings panel for individual master data sources (click one of the ovals corresponding to a master data source on the diagram to open it)
- Data preview: Displays the data table resulting from data source joins.
- Manage schema: Allows you to manage joins to the selected data source (for a detailed procedure, refer to the next step).
- Unlink: Click on it to remove the selected data source.
: Click on it to close the panel.
To join one of the master data sources to other data sources, click the corresponding oval on the diagram → click the Manage Schema tab on the panel at the bottom → click + Add a data source to join.
![]()
Refer to the description below to set up data joins.
![]()
Master data source: Displays information on the master data source to which you want to join another data source.
Datasource to join: Select a data source to be joined to the master data source.
Add to join keys: A join key defines the join relationship between the master and slave data sources in each column. Select a column to be joined from each data source, and click this button to add a new join key. For this, the two columns must be of the same data type.
Join type: Select how to join and transform a data source. To help you understand, each join type is explained below using the following tables as an example.
Master data source¶ Product name (join key) Price A $22.11 B $9.23 C $8.99 D $10.10
Data source to be joined¶ Product name (join key) Sales B 100 D 200 E 50
Inner: Imports those records of each data source whose join key column values are present also in the other data source’s join key column, joins them, and stores the joined records in the resulting table. (Intersection between two data sources)
Product name (join key) Price Sales B $9.23 100 D $10.10 200 Left: Imports those records of the right data source (data source to be joined) whose join key column values are present also in the join key column of the left data source (master data source to join), joins them to the left data source records, and stores the joined records in the resulting table. Those records from the right data source whose join key column values are not present in the left data source are discarded.
Product name (join key) Price Sales A $22.11 null B $9.23 100 C $8.99 null D $10.10 200 Right: Imports those records of the left data source (master data source to join) whose join key column values are present also in the join key column of the right data source (data source to be joined), joins them to the right data source records, and stores the joined records in the resulting table. Those records from the left data source whose join key column values are not present in the right data source are discarded.
Product name (join key) Price Sales B $9.23 100 D $10.10 200 E $null 50 Full Outer: Imports all records from both data sources, join them, and stores the joined records in the resulting table. (Union between two data sources)
Product name (join key) Price Sales A $22.11 null B $9.23 100 C $8.99 null D $10.10 200 E null 50 Preview results: Displays the data table resulting from data source joins.
Confirm the information on the imported data source, enter the Name and Description, and click Done to create a new dashboard.
![]()
The new dashboard will be added to the workbook home. Click the dashboard to display its contents.
![]()
Change dashboard size and layout¶
Click Edit Dashboard on the basic dashboard page to go to a page for editing the configuration of the dashboard. In this page, you can add a widget, edit the dashboard, set the hierarchy and change the layout.
Dashboard widget arrangement settings¶
![]()
- Change widget location: Drag the title of a widget to move the widget.
- Adjust widget width: Move the distance between widgets to adjust their widths.
- Add a widget to the display area: Drag a widget from the widget list on the right panel to the left widget display area to add the widget to the display area.
- Delete a widget from the display area: Click the X button on a widget shown in the widget display area to delete the widget from the display area.
Chart widget panel¶
On the chart widget panel, you can add/edit/delete a chart in the dashboard.
![]()
- Number of chart widgets: Displays how many chart widgets are registered in the dashboard.
- Add a chart widget: Click on it to create a new chart widget in the dashboard.
- Chart widget list: Lists chart widgets registered in the dashboard. Hover the mouse over a widget to display the edit and delete icons. Drag a widget to the widget display area to display the widget in the display area.
- Set chart hierarchy: Click on it to set parent/child relationships between charts in the dashboard. Selecting a data item from the parent chart filters the child chart by the selection. To set a hierarchy, drag the chart to be set as a child under the chart to be set as a parent. Once you finish setting the chart hierarchy, the chart menu is restructured accordingly.
Text widget panel¶
On the text widget panel, you can add/edit/delete a text widget in the dashboard.
- Number of text widgets: Displays how many text widgets are registered in the dashboard.
- Add a text widget: Click on it to create a new text widget in the dashboard.
- Text widget list: Lists text widgets registered in the dashboard. Hover the mouse over a widget to display the edit and delete icons. Drag a widget to the widget display area to display the widget in the display area.
Layout panel¶
On the layout panel, you can adjust some settings on how to arrange widgets and display each widget in the widget display area.
Set board height
- Fix to screen: Maximizes the height of the dashboard to fill the screen.
- Fix to height: Set the height of the dashboard to a specific pixel value.
- Margin between widgets: Sets the margin between widgets in the widget display area.
Chart title: Sets whether to display the title of each chart and filter widget in the widget display area.
Legend: Sets whether to display a legend for each chart widget in the widget display area.
Mini-map: Sets whether to display a mini-map for each chart widget in the widget display area.
Data source panel¶
In the data source panel, you can view and edit information on connected data sources, as well as add column filters easily. Click on a filter icon on a dimension or measure on the right-hand side to add a filter.
Please note that the filters you can apply or clear here are global filters applied to the entire dashboard, and those applied or cleared in the chart editor are all chart filters.
Check data sources in a dashboard¶
Click the button on the basic dashboard page to display a dialog box displaying information about the data source used in the dashboard. At the top-left corner, you can choose the data source that you want to view. This dialog box consists of three tabs (Data grid, Column detail, Dashboard data information).
Presentation with a dashboard¶
Click Presentation view on the basic dashboard page to view workbook dashboards with a presentation UI. In this mode, you can easily report and share data analytics results.
![]()
- Name: Name of the current dashboard.
- Slide navigation: Each circle represents a different dashboard in the workbook. For example, if you click the 4th circle, the 4th dashboard slide will be displayed with that circle highlighted.
- Auto slide show settings: Select a duration for each slide and click PLAY to start an auto slide show.
- Exit: Closes the presentation view and returns to the workbook/dashboard basic page.
Renaming columns¶
Hover the mouse over a column name on the data source panel in dashboard editing mode, and click the icon on the right to check the alias of the column.
Hover the mouse over the alias to open a window where you can enter a new column name. After entering the name, click Apply to see the change applied.
Chart¶
Charts that analyze and visualize data are the main components of a dashboard. This section describes some concepts that you need to understand to create a chart for data analytics, as well as the elements that make up the chart configuration UI.
The chart home is divided into the following three sections:
![]()
- Column/chart selection section: This section is so organized that you can create a chart step by step. You can either choose columns under the Data menu to have appropriate chart types suggested, or select a chart type under the Chart menu before choosing data columns. In addition, you can configure some analytics settings under the Analytics menu.
- Visualization section: This section is composed of the shelves onto which columns are put and the visualization area where the chart is displayed. Once data and a chart type are selected in the column/chart selection section, the chart is drawn in this area.
- Option section: Used to customize the appearance and display of the chart. Depending on the chart type, the option section may include the filter, palette, axis, numeric format, and chart format areas.
In the subsequent subsections, we will explain how to use this user interface to create and manage various types of charts.
Data column list¶
The columns listed in the data column list are categorized into “dimensions” and “measures.” For the concept of dimensions and measures, refer to “Dimensions” and “Measures”.
Structure of the data column list¶
In the data column list, you can view and edit information on connected data sources, as well as add or remove column filters easily.
![]()
Select/set data source: Allows you to select a data source or configure its associations and joins.
Data details: Click on it to pop up a dialog box displaying information about the selected data source.
Search by column name: Searches the column list by name.
Add custom column: Click on it to open the dialog box to create a new column by combining/processing data source columns. Custom columns are commonly used throughout the dashboard.
Apply/clear filter: Hover the mouse over a column to display this button. Click on it to apply a chart filter to the column, and click again to clear the chart filter. For columns to which a filter is applied, the
icon is displayed regardless of the mouse position.
More: Hover the mouse over a column to display this button. It is used to check additional information on the column and set an alias.
: Click on it to pop up a dialog box displaying a summary of the column and its data values.
- Logic column name: Shows the logical name of the column.
- Type: Shows the logical type of the column.
- Alias: Sets a column alias. A regular column name can contain only alphanumeric characters and a limited number of special characters with no spaces allowed. Therefore, setting an alias may help to identify the column for convenient analytics work. Aliases are commonly used throughout the dashboard.
- Value alias: You can also set an alias for each data value in the column. Aliases are commonly used throughout the dashboard.
Add a custom column¶
Click the + button on the data source column list to open a dialog box for adding a custom column. By applying various formulas to existing columns of the data source, you can create a new column that helps create your desired chart.
![]()
- Column name: Fill in a name for the custom column.
- Coding box: Write a code for the custom column. Click a list from the column or formula list below to type your selection in this box automatically.
- Add column: Lists the columns of the data source. Click a column in the list to automatically type your selection in the coding box.
- Add formula: Lists the formulas supported by Metatron. Click a formula in the list to type your selection in the coding box automatically, with the text cursor relocated to where a parameter needs to be inserted. For details on each formula’s purpose, use, and examples, see the help box on the right.
Draw a chart (pivoting)¶
What is pivoting¶
Pivoting is a process of grouping the given table by specific columns, thereby helping the analyst view particular aspects of the source data in a graphic or tabular chart. This process includes selecting columns that contain meaningful data and placing them on the column/row/cross shelves.
In the example shown above, two dimension columns are placed on the column shelf and one measure column is placed on the cross shelf. The chart displays data resulting from the columns placed on the shelves in this way.
Mandatory/recommended column types for each shelf vary depending on the chart type. Selecting a chart type before placing columns on a shelf shows the necessary column types for each shelf.
Column/row/cross shelves¶
Think of the structure of Excel to understand what column/row/cross shelves work for. As shown below, the crossing of each column and row cross contains a value.
Whereas Excel shows data in a two-dimensional grid composed of columns, rows and crosses, Metatron is an OLAP data discovery tool capable of multidimensional data representation. In the following Metatron chart, the column, row, and crossing axes form a three-dimensional cube.
If the values of an Excel grid are displayed in a three-dimensional chart, each crossing value will be represented by a bar. However, Metatron needs to display such a chart two-dimensionally; for this, bars either in the same column or in the same row get stacked at one point while remaining distinctive from one another. The resulting two-dimensional chart is shown in the gray area of the chart below.
Select a chart type¶
Metatron Discovery provides about 20 types of charts. If you place columns on shelves before selecting a chart, suitable charts are highlighted in purple.
The table below summarizes conditions to create, uses, and examples for each chart.
Chart name/icon Conditions to create Characteristics Uses Examples
Bar chart
Column: 1 or more dimensions / Cross: 1 or more measures Compares the value of each item. Used to compare groups or view trends over time. Very effective when the trend is significantly fluctuating. Comparison between products regarding their sales and profits
Table
Column or row: 1 or more dimensions / Cross: 1 or more measures Displays the values of crossings between two dimensions as text. Used to view measure values aggregated by certain criteria. Useful to check exact values rather than a visualization of them. Sales details by year
Line chart
Column: 1 or more dimensions / Cross: 1 or more measures Displays data changes over time. Used to view trends over time. If changes are moderate, a line chart is more effective than a bar chart. Monthly sales trend
Scatter chart
Column: 1 measure / Row: 1 measure / Cross: 1 or more dimensions Displays relations between items. Used to define relations between two parameters. Relations between product sales and profits
Heatmap
Column or row: 1 or more dimensions / Cross: 1 or more measures Displays the values of crossings between two dimensions in colors and sizes at different points. Used to provide an intuitive view of relations between two dimensions represented by colors and sizes. Similar to a table chart, but more of a visual type. Sales of each product by region
Pie chart
Cross: 1 or more dimensions, 1 or more measures Shows how much each item accounts for. Used to compare the compositions of something. Comparison between web browsers regarding their market share
Map view
Layer shelf: dimension (location attribute), 1 or more dimensions, 1 or more measures Displays the data for each location on the map. Used for intuitive comparisons of variables by using colors for each region. Used to emphasize visual elements. Comparison of sales of each product by region
KPI
Cross: 1 or more measures Displays main indicators along with their trends. Used to quickly convey information on an organization’s current achievement. An organization’s performance index, such as how many customers have been brought in this year
Box plot
Column: 1 or more dimensions / Row: 1 dimension / Cross: 1 measure Indicates increase and decrease in value. Used to compare groups regarding their share. Proportion of flight delay accounted for by each airplane model
Waterfall chart
Column: 1 time-dimension / Cross: 1 measure Displays cumulative changes resulting from the increase or decrease in value for each time interval. Used to emphasize increase and decrease in value over time. Changes in the number of team members for a certain period; stock price trends
Word cloud
Cross: 1 or more dimensions, 1 measure Displays words sized in proportion to the number of mentions. Used to summarize and emphasize important words. Summary of the voices of customers
Combo chart
Column: 1 or more dimensions / Cross: 2?4 measures Compares data by combining bar and line charts. Used to visualize different types of data simultaneously. Simultaneous monitoring of product price and sales
Treemap
Column: 1 dimension / Row: 1 or more dimensions / Cross: 1 measure Displays hierarchical data using nested rectangles. Used to visualize hierarchical data. Monitoring of sales of products classified into major, medium, and minor categories.
Radar chart
Cross: 1 dimension, 1 or more measures Displays different quantitative variables on axes starting from the same point. Used for a visual comparison among different quantitative variables. Product quality evaluation in five aspects.
Network diagram
Subject shelf: 1 dimension / Target shelf: 1 dimension / Connecting shelf: 1 measure Diagram connecting elements in dependence relations Used to view data flows regarding where data elements are generated. Monitoring the task flows of a project
Gauge chart
Column: Row: 1 or more dimensions / Cross: 1 measure Visualizes performance for the specified target. Used to view the proportions of data elements. Monitoring of profits by region
Sankey diagram
Column: 3 or more dimensions / Cross: 1 measure Displays the proportion of each data flow by the width of the connection line. Used to monitor data flows and their respective sizes. Monitoring energy flows in a factory
Chart style attributes¶
Once data is pivoted, an options menu is shown on the right of the screen to allow you to set the chart style. The composition of the menu varies with chart type. This section describes the settings used universally by all chart types and the “Common Setting” items for each chart type.
Chart style settings menu¶
This section describes how to configure the settings of the chart style settings menu. Note that not all the settings are shown for every chart type.
Color setting
Defines various colors used in the chart.
Graph color setting: Set criteria to classify data on the chart by color, and select a coloring theme.
- Series: Colors data elements differently with measures.
- Dimension: Colors data elements differently with dimensions.
- Measure: Colors elements differently with the size of each aggregate of measure values.
Setting color range: This setting is displayed when Measure is selected as the criterion to classify data by color. Set “ON” to set colors differently with each range of measure values. The measure data to be colored can be subdivided into as many ranges as you want, starting with the lowest one. To add a new range, adjust the upper limit of the highest range and click Add new range.
Number format
Defines how to display numerical text data on the chart graph. To use this function, turn on Show Axis Label in the Data Label Settings Menu.
- Format: Select a display format for numeric values from among number, currency, percent, and exponent.
- Decimal place: Set how many digits to display after the decimal point.
- Number abbreviations: You can use K (thousands), M (millions), or B (billions) as an abbreviation for a large numeric value. Select Automation to automatically set the most proper symbol in accordance with the number of digits.
- Thousands separator: Select whether to add thousands separators when displaying numeric data values.
- Customer symbol: Insert a custom text before/after numeric data values.
- Preview: Displays the result of the defined number format.
Y-axis setting (when chart type is vertical)
If you set the chart direction Horizontal in the Common Setting area, the settings are exchanged between X-axis and Y-axis.
Show axis title: Used to set a title for the Y-axis of the chart. Disabling this function hides the title of the Y-axis.
Show axis label: Select whether or not to show the data labels on the Y-axis of the chart. Disabling this function hides the data labels on the Y-axis.
- Label setting: Set the numeric format of the data labels on the Y-axis. Set automatic to import the settings of Format or manual to set specific format for the data labels on the Y-axis.
X-axis setting (when chart type is vertical)
Defines how to display the X-axis of the chart. If you set the chart direction Horizontal in the Common Setting area, the settings are exchanged between X-axis and Y-axis.
Show axis title: Used to set a title for the X-axis of the chart. Disabling this function hides the title of the X-axis.
Show axis label: Select whether or not to show the data labels on the X-axis of the chart. Disabling this function hides the data labels on the X-axis.
- Rotation: Select an angle for the data labels on the X-axis from among 0, 45, and 90 degrees.
Data label setting
Selects whether to display the data values on the chart graph.
Common settings for each chart type¶
This section describes how to style the six most popular chart types (bar chart, table, line chart, scatter chart, heatmap, and pie chart).
Bar chart
This type of chart presents data values in each category of a dimension column with rectangular bars.
![]()
Chart type
- Vertical: Displays data values as vertical bars with the dimension axis set vertical.
- Horizontal: Displays data values as horizontal bars with the dimension axis set horizontal.
- Parallel: If more than one measure are selected, different bars representing those measures are displayed in parallel.
- Stacked: If more than one measure are selected, different bars representing those measures are stacked at one position.
Limitation: Set how many columns to display on the chart.
Table
A table block is formed based on the categories into which the dimension columns on the column/row shelves are grouped; accordingly, the values of the measure columns on the cross shelf are displayed as text in the crossings.
![]()
Chart type
- Pivot: Aggregates (
SUM
,MIN
,MAX
, etc) measure values for each pair of column and row dimensions into a different cell.- Original: Displays all original measure values as unaggregated together with the selected dimensions.
- Vertical: Displays measure values vertically in the table. This cannot be used when “Original” is selected for displaying the table.
- Horizontal: Displays the table horizontally when “Pivot” is selected for displaying the table. Displays measure values horizontally in the table.
Show head column: Set horizontal and vertical text alignment in the column headers. When “Original” is selected, the column headers are necessarily shown. When “Pivot” is selected, you may optionally hide the column headers.
Line chart
This type of chart presents data values in each category of a dimension column with points. Adjacent data points are connected with each other. This type of chart is used to view trends.
![]()
Chart type
- Line type: Displays the chart graph by drawing lines between points that represent measure value aggregates.
- Area type: Colors the area formed by the connecting lines.
- Line & point: Shows both the data points and connecting lines.
- Point: Shows the data points only.
- Line: Shows the connecting lines only.
- Basic: Displays each aggregate as it is on the chart.
- Cumulative: Displays cumulative aggregates on the chart.
Scatter chart
This type of chart presents data values in each category of a dimension column with defined symbols.
![]()
- Symbol type: Set the shape of the symbol to be shown on the chart.
- Symbol transparency: Set the transparency of the symbol to be shown on the chart. You can set colors either solid or transparent.
Heatmap
This type of chart displays values aggregated from the measure column placed on the cross shelf by using colors. For a larger aggregated value, a darker color is applied. The heatmap type does not provide any common settings.
Pie chart
This type of chart visualizes the proportion of each category of the dimension column.
![]()
Chart type
- Sector: Displays a pie-shaped chart.
- Donut: Displays a donut-shaped chart.
Filter¶
Filters are to display only data matching their preset conditions when forming dashboards and charts. Charts use two types of filters: chart filters and global filters. Chart filters are applied to individual charts, whereas global filters are applied to an entire dashboard.
Chart filters¶
A chart filter defines what range of data is to be shown on the chart. This chapter describes how to set up and make use of chart filters.
Automatically included filters¶
The following column filters are included automatically when a chart is created:
- Timestamp column filter: As a time-series data store, the Metatron engine necessarily uses a time filter.
- Recommended filters: Column filters designated as “recommended filters” during the registration of the data source.
- Dashboard filters set global: Filters applied to all charts registered in the dashboard.
Chart filter panel¶
The chart filter panel is located on the right-hand side of the chart home screen. On this panel, you can easily view and configure registered filters.
![]()
- Filter number: Displays how many filters are registered for the chart.
- Add/edit filter: Click on “+” at the top right to either add a new filter or open a popup for configuring an existing filter.
- Columns applied with the filter: The top part of each individual filter displays which columns are applied with the filter.
- Filter settings: Click the hamburger menu at the top right of an individual filter either to reset the filter or configure the details of the filter.
Chart filter dialog box¶
Click the button at the top of the chart filter panel or click the button in each filter area to open the chart filter dialog box. With this dialog box, you can add a new filter or configure an existing filter.
The chart filter dialog box is divided into the Dimension and Measure tabs as shown below:
Dimension filtering¶
From the connected data source, select a dimension on which to create a filter.
![]()
Value range: Select whether to filter the chart by a single or multiple data categories.
- Single: Select one data category by which to filter the chart.
- Multiple: Select multiple data categories by which to filter the chart.
Search: If there are too many elements in the column, this function allows you to limit the results only to those you wish to see.
- Search by name: Search the column element list by name.
- Element filtering: Filters elements either by matching element names with regular expressions or wildcards, or by applying a range condition to a measure.
![]()
- Defined value: Used to add?as a filter criterion?a data element that is not contained in the column. This allows you to create a filter in advance for a data element that may be added later.
Timestamp column filter settings¶
Dimensions with a time icon displayed are of a timestamp type for which a timestamp filter can be configured. Although they are set to “All time” by default, you can select Relative or Specific if you wish to display only data from a certain period in the chart.
“Relative” sets a period of time relative to the present and displays only data from the applicable period of time in the chart.
“Specific” directly sets a certain period of time of data and displays only data from the applicable period of time in the chart.
Measure filtering¶
From the connected data source, select a measure on which to create a filter.
Once you have selected a measure, designate the range of values to filter.
Global filters¶
Global filters specify which data is to be displayed in all charts of a dashboard. They can be added, edited, or deleted in the filter panel in the dashboard editing window.
![]()
- Number of filter widgets: Displays how many filter widgets are currently registered in the dashboard next to the global filter heading.
- Add a filter widget: Click the “+” icon at the top right to create a new filter widget in the dashboard. The filter creation popup interface and process for creating filters are the same as the process for creating chart filters described in the previous section.
- Filter widget list: Lists filter widgets registered in the dashboard. Hover the mouse over a widget to display the edit and delete icons. Drag a widget to the widget display area to display the widget in the display area.
Global filters applied to the entire dashboard are also listed when creating an individual filter for a new chart. When creating a global filter, if there are any individual chart filters, it intuitively notifies you of which column the filter was created from.
Notebook¶

Metatron Discovery supports a notebook function. Notebook is a tool for creating and sharing documents that include live codes, equations, visualizations, and descriptive texts. It is mostly used for data cleaning and manipulation, numerical simulations, statistical modeling, and machine learning.
Metatron Discovery allows users to register and use external Jupyter and Zeppelin servers. Jupyter uses Python and R?programming languages commonly used in data science?while Zeppelin uses Spark (Scala) to help with real-time and interactive analysis and visualization of data. Before running the notebook, its server must be set up.
Register a notebook server¶
To analyze data in a workspace using a notebook, initial settings are required for the notebook server. The procedure for initial settings for a notebook server is as follows:
Click the
button in the top-right corner of the workspace and select Set notebook server.
![]()
From the list of Jupyter and Zeppelin servers preregistered by the administrator, select the notebook server that you wish to connect to and use in your workspace and click Done.
![]()
Create a notebook¶
Once the notebook server has been set up, you can create a notebook. A notebook is created as follows:
Click the + Notebook button at the bottom of the workspace. You’ll be prompted to create a notebook.
![]()
Select the type of data set that you wish to analyze in the notebook. You can choose between Data source, the unit of data used in Metatron Discovery, Dashboard, Chart, and Not selected. If you want to use Zeppelin, select Not selected.
![]()
After selecting either Data Source, Dashboard, or Chart, you can see a list of data currently registered in Metatron Discovery. Select the data to analyze and click Next.
![]()
Enter the information about the notebook that you want to use as an analytics tool for data. The server type can only be selected for a notebook server connected at the initial notebook server setup. If Jupyter is selected, “R” or “Python” can be selected for analysis, whereas “Spark” (Scala) is used when Zeppelin is selected.
![]()
Once a notebook has been created, you can find it in the workspace.
![]()
Use a notebook¶
In a newly created notebook, you can write a script and serve it through a REST API. A notebook can be used as follows:
Detailed notebook page¶
On the workspace screen, select the notebook you want to use as an analytics tool. Then, the following screen with detailed information appears. You’ll see basic information on the notebook: data type, data source name, development language, and analytic code, etc.
Notebook coding¶
Click Detail on the notebook page to pop up a new window for coding in the notebook. At the top of this window, a code to load a dataset is inserted; executing this cell loads a JSON dataset as the dataset object.
The screen above appears when Zeppelin is selected and includes a cell for loading the data selected when the notebook was created. After coding the program starting from the third cell, click Save when you are finished.
Register a notebook API¶
Once you write a notebook code, you can return the results by calling a REST API. Select a Return type by referring to the descriptions below, and enter a Name and Description.
![]()
HTML: The results of running the notebook script are returned in HTML.
JSON: The results of running the notebook script are returned in a custom JSON format. In this case, the
response.write(...)
function provided by Metatron Discovery will be used. The following is an example code for using the response.write function:
- R-based notebook:
response.write(list(coefficient = 2, intercept = 0))
- Python-based notebook:
response.write({'coefficient' : 2.5, 'intercept' : 0})
None: Runs the notebook script but does not provide returns.
Once you enter API information and click Done, the API is created to provide a REST API URL as shown below. Click Result to view the URL execution results in a popup window.
Workbench¶

Metatron Workbench provides an environment for data preparation and analytics based on SQL. Its main functions are as follows:
Various external databases can be loaded in one space.
The user can conveniently navigate/select linked tables and columns and view their details.
Query edit tools are embedded and query results can be viewed interactively and available for various uses:
- Query results can be downloaded into a local file or exported to an online Excel.
- Query results can be interactively visualized to help the analyst see an outline of the resulting data table.
- Query results can be stored as a data source available for analytics in a workbook or notebook.
Each document that stores SQL-based analytic queries is called a “workbench.” This chapter introduce how to create and use workbenches.
Create a workbench¶
To use a workbench in the workspace, a workbench-type data connection must be established. See Data Connection for how to handle it.
To create a workbench:
Click the + Workbench button at the bottom of the workspace. You’ll be prompted to select a data connection for data analytics.
Select the workbench-type data connection that connects to the data table you want, and click Next.
- Search by name of data connection: Searches the list of data connections available to the workspace by the name you type in.
- DB type: Filters data connections by database type (Oracle/MySQL/Hive/Presto/Tibero). Select All to display data connections regardless of database type.
- Account type: Filters data connections by account type (All/Always connect/Connect by user’s account/Connect with ID and password). Select All to display data connections regardless of account type.
- Data connection: Lists data connections filtered by specified criteria.
Confirm the information of the selected data connection and enter a name and a description to create a workbench.
The created workbench is immediately available.
Use a workbench¶
In the workbench, you can edit and manage an SQL database easily, as well as visualize and store the results of a query on it in various forms. The workbench page consists of five sections shown below, and an additional schema browser is provided.
![]()
- Basic information section (See Basic information section)
- Schema and table section (See Schema and table section)
- Query editor section (See Query editor section)
- Query results section (See Query results section)
- Extra tools section (Extra tools section)
- Schema browser (Schema browser)
Basic information section¶
This section displays basic information on the active workbench.
![]()
- Name: Name of the workbench. Click on it to change the workbench’s name.
- Data connection: Name of the data connection used by the workbench. Click the
icon to view its details.
: UI button to collapse or expand the panel.
Schema and table section¶
This section provides a UI to conveniently insert the name of a database, table, or column in the query editor.
![]()
- Database name: Displays the name of the selected database. By default, the first database of the data connection used by the workbench is selected. Click on it to list all databases included in the data connection. Select a database in the list to replace the currently selected database.
- Schema browser: A popup browser displaying the table list of the selected database, and information of all the columns and records in each table.
- Search table: Searches the list of the tables registered in the selected database by the name you type in.
- Table name: Select a table to automatically insert it in the query editor along with a
SELECT \* FROM {table name}
query.- Column list: Displays all columns belonging to the table and their respective data types. Click a column name to automatically insert it in the query editor.
Query editor section¶
This section allows you to edit and run queries.
![]()
: Navigates to tabs of previous or subsequent queries when there are too many tabs. If tabs are not many, this button will not appear.
- Tab: You can run or store queries in separate tabs for more efficient management of them. Click the
button to edit the tab title or delete the tab.
: Click this button to add a new tab.
: Click this button to minimize the query editor or maximize it to full screen.
- Query lines: Displays the numbering of the query code lines.
- Editor area: Write query statements in this area. You can run either single or multiple queries. Insert
;
at the end of each query statement to run them separately. Autocomplete is supported.- Execute full: Execute all queries in the editor. (Shortcut: Ctrl + Enter)
- Execute partial: Executes only the query statement where the cursor is located, or execute queries selected by dragging the mouse. (Shortcut: Command + Enter)
- CLEAR SQL: Clears all query statements.
- SQL BEAUTIFIER: Re-words query statements using standard query syntax.
- Query History: Lists past queries executed in the query editor. If you select a query in the list, it will inserted in the query editor.
- Query Editor Shortcuts: Shows a list of shortcuts available in the query editor.
Query results section¶
Once a query is executed, its results are displayed in a query results tab. Query results tabs are cumulatively added, and you can selectively delete specific results tabs. Query results are displayed in a text grid, and they can be previewed in charts, stored into data sources, and exported into CSV files.
During query execution¶
![]()
Query result tabs: When multiple queries are executed, a different tab is created for each query to show its result. While a query’s execution is in progress, “Loading” is displayed in its tab title.
Query log: Shows an execution log for the query. In the case of a Hive connection, a Hive job log is additionally displayed.
Cancel: Cancels the execution of the query. The time taken for cancelation may vary with the DB type.
Query execution phase: Shows the current phase of query execution. There are a total of five query execution phases.
- Getting connection
- Creating statement
- Executing query
- Getting result set
- Done!
No. of the current query: Shows the number of the currently executed query when multiple queries are executed.
Output/Result tabs: By clicking either tab, you can switch to the query log/result view.
Query status: Shows the query’s status from among:
- Running query
- Query execution failed..
- Query execution canceled..
Query start time: Displays when the query execution started.
Query running time: Displays how long it took to execute the query.
After query execution¶
![]()
- Query result tabs: When multiple queries are executed, a different tab is created for each query to show its result. While a query’s execution is in progress, “Loading” is displayed in its tab title.
- Data details: Shows a data table resulting from executing the query. You can copy this data output to the clipboard.
- Output/Result tabs: By clicking either tab, you can switch to the query log/result view.
- Search for column data: Searches for a column or value in the resulting table.
- Chart preview: Draws a virtual chart of the query results. This chart is only for visualization; it is not stored in the workspace. (See Chart for how to handle it)
- Save as Data source: Stores the query results into a data source in the workspace. A dialog box will pop up to create a data source, and the resulting table is used instead of selecting a data connection and a table. Therefore, you will be immediately prompted to set the schema definition and ingestion cycle. (See Create a data source for how to handle it)
- Export CSV file: Downloads the resulting table into a local file (CSV).
- Data page navigation: If the resulting data includes more than 1,000 rows, you can navigate the data pages using the Prev and Next buttons.
- Query start time: Displays when the query execution started.
- Query finish time: Displays when the query execution finished.
- Query running time: Displays how long it took to execute the query.
- Query data rows: Shows the number of rows of the resulting data and the current page number.
Extra tools section¶
The extra tools section provides useful tools for the workbench.
- Setting up global variables for repeatedly used statements (See Setting up global variables)
- Navigation to move to another workbench (See Workbench Navigation)
Setting up global variables¶
If a certain type of statement is repeatedly used with a different value for each query run, set the variable element as a “global variable” for convenient use.
- Variable type: You can select either a calendar or text type.
- Add new variable: Select the variable type you want and click “Add new variable.” A new global variable will be added in the query editor.
- Name: Enter a name for the variable.
- Variable value: For a calendar variable, select a date; for a text variable, select a text value.
Schema browser¶
Displays the table list of the selected database, and information of the columns and records in each table.
![]()
- Column: Shows the names and data types of all columns of the selected table.
- Information: Displays attributes of the selected table.
- Data: Displays data of the selected table. A maximum of 50 rows can be viewed.
Data Preparation¶
Data Preparation is a tool that creates transformation rules to transform files and tables for more convenient analysis of datasets, and saves the results into HDFS or Hive.
Advantages of data preparation in Metatron Discovery
Users can create transformation rules by following the step-by-step process as shown in the above GUI. Since the transformation results from each step are stored in memory together with the data distribution, users can easily check the results through the simple click of a button and perform undo and redo just like using a text editor.
Based on these characteristics, the data preparation tool offers the following advantages:
- Users unfamiliar with programming or data processing can obtain the desired results.
- Adding a transformation rule usually involves programming or writing an SQL query. However, Metatron Discovery’s Data Preparation provides a GUI for exploratory transformation that enables the creation of transformation rules simply by clicking a button or typing.
- Basic data transformation is conducted automatically. For instance, a type cast is automatically applied to columns comprised of numerals. This is made possible by the undo and rule deletion functions.
- Data of different forms can be combined as desired (e.g. reference file + fact table).
- The results of data refinement can be shared with others, thus reducing the burden of exchanging physical data.
- Storage space is saved and information life cycle (ILM) shortened by deleting the actual data and retaining only the transformation rules involved. The actual data can be easily created whenever needed.
Structure of data preparation in Metatron Discovery
As shown in the above figure, data preparation is comprised of a dataset built from the target data, a dataflow that defines transformation rules for the designated dataset, and a data snapshot that shows the transformation results.
Create a dataset¶
A dataset, which is the basic unit of data preparation, refers to an entity subject to data operations. Datasets are either imported datasets and wrangled datasets.
- Imported Dataset: A source data entity before the implementation of transformation rules
- Wrangled dataset: A data entity subject to analysis following the implementation of transformation rules
A wrangled dataset is created during the dataflow setting process, which defines transformation rules, while an imported dataset is created during this dataset creation procedure.
The Dataflow menu can be accessed under MANAGEMENT > Data Preparation > Dataset on the left-hand panel of the main screen.
Next, on the upper right of the dataset page, click the + Generate new dataset button to create a new dataset.
In the dataset creation page, select the dataset type.
- My file: Create a dataset by opening the user’s local file or via a URI (upcoming feature) (See Create a dataset from a file for a detailed procedure).
- Database: Create a dataset using external database access information and queries (See Create a dataset from a database for a detailed procedure).
- Staging DB: Create a dataset from the staging DB built in Metatron (See Create a dataset from staging DB for a detailed procedure).
Note
The Staging DB is an in-cluster database that stores data temporarily in order to facilitate data loading. Hive is generally used for it.
Create a dataset from a file¶
Create a dataset by opening the user’s local file or via a URI (upcoming feature).
On the data type selection page, select My File.
Select a file to be used as a data source from your local PC. You can click the Import button to select a file, or drag and drop the file into the box. Once a file is selected, click Next.
![]()
Check the grid of the uploaded file, and designate a column delimiter. Proceed if the data is successfully displayed.
![]()
Enter the Name and Description of the dataset, and click the Done button.
![]()
Once the dataset is created, the dataset list is displayed. You can check that the list contains the newly created dataset.
![]()
Create a dataset from a database¶
Create a dataset using external database access information and queries.
To create a dataset from a database, you should first create a data connection. See Create a data connection for a detailed procedure.
After establishing the data connection, go to MANAGEMENT > Data Preparation > Dataset > + Generate new dataset.
On the data type selection page, select Database.
Select the data connection, and press the Test button to check that the connection is valid.
![]()
Select the data. You can either select a table from the connected database, or write a query yourself.
![]()
- Table: Select a database and a table to display the table’s data. Once the data being ingested has been displayed, confirm the data and click Next.
- Query: Write a query to import the data you want, and click Run to display the data in the lower section. Confirm the data and click Next.
Enter the Name and Description of the dataset, and click the Done button.
![]()
Once the dataset is created, the dataset list is displayed. You can check that the list contains the newly created dataset.
![]()
Create a dataset from staging DB¶
Create a dataset from the staging DB built in Metatron.
The creation of a staging DB dataset is the same as dataset creation from a database, but does not involve the selection of a data connection.
On the data type selection page, select Staging DB.
Select the data. You can either select a table from the connected database, or write a query yourself.
![]()
- Table: Select a database and a table to display the table’s data. Once the data being ingested has been displayed, confirm the data and click Next.
- Query: Write a query to import the data you want, and click Run to display the data in the lower section. Confirm the data and click Next.
Enter the Name and Description of the dataset, and click the Done button.
![]()
Once the dataset is created, the dataset list is displayed. You can check that the list contains the newly created dataset.
![]()
Manage a dataflow¶
A dataflow is the unit of processing a dataset. A single dataflow can be associated with multiple datasets to perform transformations. That is, a dataset must belong to a dataflow for transformation rules to be applied. It forms a relationship such as a “join” or “union” with other datasets.
As shown below, the dataflow details page shows the dependency among all datasets in a dataflow, and the transformation rules applied to each dataset.
The following subsections cover the processes involved in defining a dataflow, such as adding a dataset, editing transformation rules, and creating a data snapshot with transformation results.
The Dataflow menu can be accessed under MANAGEMENT > Data Preparation > Dataflow on the left-hand panel of the main screen.
Add a dataset¶
The first step in defining a dataflow is to add a dataset. This can be conducted using the two methods described below:
Adding a dataset after creating an empty dataflow¶
Click Add a dataflow on the upper right of the Dataflow page.
Enter the Name and Description for the dataflow, and click Done to create an empty dataflow.
Click the Add dataset to this dataflow button on the center of the page.
Select the datasets to be added.
When an imported dataset and its corresponding wrangled dataset are created, click the Edit rules button to edit rules (see Edit rules for a detailed procedure).
Creating a dataflow in the dataset details page¶
In the dataset details page, click the Create dataflow with this dataset button to create a dataflow, and proceed until the step before Edit rules.
![]()
Note
The dataflow is named based on the name of the dataset.
Edit rules¶
The key task in data preparation is to create rules for data transformation (usually refinement). The transformation rules and input/output specifications are combined to be applied to actual data or other similar data, or scheduling is performed for such tasks.
Below are instructions on creating rules, checking the results, and modifying or deleting rules.
The Edit Rules page consists of the following:
![]()
- Column type, name, and menu button
- Menu for simple rule creation
- Rule list and insert button (appears when cursor is placed in between rules)
- Enabled when undo and redo are available
- Panel to enter rule details
- Column value distribution, distinct count, type mismatch, null value, etc.
Create a rule¶
Using the column header menu¶
Select a target column by clicking the column header.
- Press the function key to select multiple columns.
- Depending on your OS, click while holding the ^ or ⌘ key to select/deselect a column (toggle).
- Click while holding the Shift key to select a range.
Click the
icon in the header of a selected column to open the header menu, and select a transformation command.
- Among the commands, drop and settype are performed upon clicking.
To add details, fill out the command input panel below, and click the Add button.
Some commands can be performed by selecting a distribution bar.
- Click a distribution bar to filter the data based on the selected range (toggle).
- Click the type mismatch or null value graph to set conditions for those values.
Using the command input panel¶
Select a transformation rule (command) in the command input panel.
Add details as needed, and click the Add button.
- Target columns can be selected using the input panel. You can also designate a column by clicking the column header.
Inserting into a rule list¶
In the list of rules of the right, place the cursor over the boundary where you wish to insert a new rule. The + Insert rule button appears. Press this button.
Select a transformation rule (command) in the command input panel. Add details as needed, and click the Add button.
- When a rule is inserted in this manner, all subsequent rules are affected.
- Rules that cannot be normally executed are displayed in red. In this case, they will revert to the results obtained in the previous step.
Edit a created rule¶
Editing a rule¶
In the list of rules on the right, place the cursor over the rule to be edited. The
button appears. Press this button.
Edit the rule in the command input panel and press the Done button.
- When a rule is edited in this manner, all subsequent rules are affected.
Deleting a rule¶
In the list of rules on the right, place the cursor over the rule to be deleted. The button appears. Press this button.
- When a rule is deleted in this manner, all subsequent rules are affected.
![]()
Undo and redo¶
On the upper right of the rule list are icons to perform undo and redo.
To revert to a state before executing a command, press the button.
- The dataset reverts to the state before the last transformation (including rule creation, modification, and deletion).
- All rules that were affected also revert to their previous states.
To perform the same command again, press the button.
- Pressing
is faster than following the steps to perform the same command again. It is because the transformation results are stored in memory.
Rule types¶
This section describes each rule in terms of the following.
- Name of rule
- Required arguments
- Optional arguments
- Description
- Notes
The types of rules supported in data preparation are as follows:
In addition to these rules, data preparation provides various expressions, thereby supporting almost every function required for general data preprocessing.
header¶
Required arguments: Row number that contains the column name (1-base)
Description
- This rule sets the content in the designated row as the column name.
- This is useful for reading a CSV file with column names in the first row.
- Unless otherwise specified, data preparation automatically performs header. This rule may be deleted if header results are not desired, but such cases are not common.
settype¶
Required arguments
- Column: A list of target columns
- New type: Select one out of Long, Double, String, Boolean, and Timestamp
Optional arguments
- Set format: A format string (Joda Time) in the case of timestamp
Description
- This rule changes the type of the selected columns.
- The rule is considered successful even if the result is a type mismatch, which should be separately addressed.
setformat¶
Required arguments
- Column: A list of target columns
- Set format: A Joda-Time format string
Description
- This rule changes the display format of a Timestamp column.
- The target column must be of the Timestamp type.
Notes
As shown below, the format input field lists different entries depending on the input. The candidate list is narrowed as more values are entered.
![]()
rename¶
Required arguments
- Column: A single target column
- New column name: New name
Description
This rule changes the name of the selected column.
To rename two or more columns at once, click the Rename multiple columns button at the bottom of the command input panel to display the following popup.
![]()
keep¶
Required arguments
- Condition: A conditional expression returning a Boolean value
Description
All rows are deleted except the rows that return true for the conditional expression.
![]()
delete¶
Required arguments
- Condition: A conditional expression returning a Boolean value
Description
- All rows that return true for the conditional expression are deleted. This is the opposite of keep.
replace¶

Required arguments
Column: A list of target columns
Pattern: A string pattern to be replaced
- In the case of a constant string: Characters enclosed inside
'
('Houston'
,'Naperville'
,'Philadelphia'
etc.)- In the case of a regular expression: Characters enclosed inside
/
(/[ ,_]+/
,/\s+$/
, etc.)New value: A new string expression to replace the specified pattern
- Constant string
- Regular expression
$1_$2_$3
, etc.
Optional arguments
- Ignore between characters: Does not make any replacement for content between the characters entered here
- Match all occurrences: Whether all characters of a word must match
- Ignore case: Whether to make the strings case-insensitive
Description
- String replacement is performed for the selected columns.
Notes
- Do not use
'
or/
in a new value.- Values from other columns are not available as new values. replace performs string replacement for content in the selected columns only. (cf. set rule)
set¶

Required arguments
Column: A list of target columns
Expression: An expression to be applied to the values of the target column. Values from other columns may be referenced. (cf. replace rule)
- When multiple columns are involved, use a
$col
variable, which will be substituted by the respective target column during each conversion.- That is, when applying the set command on
column1
andcolumn2
,$col
becomescolumn1
during conversion ofcolumn1
, and$col
becomescolumn2
during conversion ofcolumn2
.
Optional arguments
Use only under the following conditions
- The set rule is applied only to rows satisfying this condition.
- This rule may be regarded the same as the
WHERE
statement in SQL.
Description
This rule replaces the values in the selected column with results returned by the expression.
When using a complex expression, click the Advanced editor to display the popup shown below:
![]()
In the Advanced editor, you can edit the expression in a larger window while viewing the column list and a list of functions and their descriptions, and also run a validity check before implementing the expression.
derive¶
Required arguments
- Expression: An expression whose resulting values are to form a new column. Similar to the set rule, values from other columns may be referenced.
- New column name
Description
- While similar to the set rule, this rule creates a new column instead of replacing an existing one.
Notes
- The new column is inserted after the last existing column in the expression.
split¶
Required arguments
- Column: A list of target columns
- Pattern: A string expression that serves as a separator that splits the target strings. Allows a regular expression as is the case for the replace rule.
- Number: Number of columns to be divided into.
Description
- Each row is split by the given Number - 1.
- When the pattern is no longer matched, the rest columns contain a null.
Notes
- Note that columns are created as many as the Number input.
merge¶
Required arguments
- Column: A list of target columns
- Delimiter: A constant string with which values of different columns are concatenated.
- New column name
Description
- The target columns are merged with the Delimiter into a new column.
Notes
- Similar to the replace rule, enclosing with a
'
may be skipped. That is, strings not enclosed by/
or'
are automatically enclosed by'
.
extract¶
Required arguments
- Column: A list of target columns
- Pattern: A string pattern to be extracted. Allows a regular expression as is the case for the replace rule.
- Number: Number of instances to be extracted
Optional arguments
- Ignore between characters: Does not make any replacement for content between the characters entered here
- Ignore case: Whether to make the strings case-insensitive
Description
- A new column(s) with content matching the given pattern is created.
Notes
- When there are multiple target columns, the resulting columns are inserted after each target column.
countpattern¶
Required arguments
- Column: A list of target columns
- Pattern: A string pattern to be detected. Allows a regular expression as is the case for the replace rule.
Optional arguments
- Ignore between characters: Does not make any replacement for content between the characters entered here
- Ignore case: Whether to make the strings case-insensitive
Description
- New columns are created based on the number of matches with the pattern.
- This is highly similar to extract. The only difference is that it counts the number of matches, rather than extracting the matched content.
Notes
- When there are multiple target columns, the resulting columns are inserted after each target column.
nest¶
Required arguments
- Column: A list of target columns
- Type: Map or Array
- New column name
Description
The target columns are grouped into a new column of the given type.
Below are examples of grouping columns into an array and map, respectively.
![]()
unnest¶

Required arguments
- Column: A single target column
- Select elements: 0-base index for an array, or key value for a map
Description
- A new column is created by extracting the selected elements from an array or a map.
Notes
- The target column must be of the array or map type.
flatten¶
Required arguments
- Column: A single target column
Description
- Rows are created from elements of an array.
Notes
The target column must be of the array type.
![]()
If the target array column has four elements as shown in the above example, each original row of the array results in four rows. Non-array columns result in the same columns.
![]()
aggregate¶

Required arguments
- Expression: A list of aggregate functions
- Group by: A list of columns that group values by.
Description
A new column is added from the results of grouping by each combination of the elements from the GroupBy columns.
A column is created for each expression. For example, two columns are created if average and count are designated as expressions.
The available aggregate functions are as follows:
- count()
- sum(colname)
- avg(colname)
- min(colname)
- max(colname)
Notes
Calculations are performed only for sampling results. Therefore, the snapshot?the results for the entire data?may be different.
Note that
()
must be inserted when using the count function.count(colname) is currently not available.
![]()
pivot¶

Required arguments
- Column: A list of columns subject to pivoting
- Expression: A list of expressions whose resulting values form new columns (only aggregate functions are available)
- Group by: A list of columns that group values by.
Description
- Group By is performed for each combination of target columns and GroupBy columns. A dataset having the results as column values is created.
- A set of columns is created for each expression. For example, if average and count are designated as expressions and the values in the pivoted columns are divided into ten groups, a total of 20 columns will be created.
Notes
This is used when performing GroupBy on at least two columns. (1 pivoted column, 1 GroupBy column)
Here, Rename multiple columns is useful as column names tend to get longer.
![]()
unpivot¶

Required arguments
- Column: A list of target columns to be converted into values in new columns
- GroupEvery: Number of columns (defaults to 1)
Description
- Two columns are created?one contains the selected column names and the other contains their values. (If GroupEvery is set to 1)
- If GroupEvery is the same as the number of selected columns, each resulting pair of columns contains the name and values of its respective original column. Therefore, If 10 columns are unpivoted with the GroupEvery argument set to 10, for example, a total of 20 columns are created.
Notes
Using the GroupEvery argument set to a factor of the number of columns will soon be supported.
<Where GroupEvery is set to 1>
![]()
<Where GroupEvery is set to the same as the number of columns>
![]()
join¶

Unlike other rules, join has a separate popup.
Required arguments (select in a popup or enter a value)
- Dataset to join: A wrangled dataset in the same dataflow
- Columns to join (toggle)
- Join keys: Multiple values may be entered
- Join type: Only inner join supported now
Description
- Joins to the target dataset to create new columns.
- This rule is the same as
join
used by a relational database.- The results can be previewed by clicking the Show result button.
Notes
The join keys must be included in the columns to join.
![]()
union¶

Similar to join, union has a separate popup.
Required arguments (select in a popup)
- Datasets to union: Multiple selections allowed.
Description
- The content of the selected datasets is also processed.
- This rule is the same as
union all
used by a relational database.
Notes
The target datasets must coincide with the dataset that unions them in terms of column name, type, and number of columns.
![]()
window¶


Required arguments
- Expression: A list of window functions
- Group by: A list of columns that group values by. Row order created within each group. If not specified, the whole data is sorted based on the Sort by setting.
- Sort by: Specifies columns by which the order of rows is determined. If not specified, data is sorted in the order of being inputted.
Description
Column values are created by calculating with the values of the preceding and following rows.
The rows are grouped first and then sorted within each group in the specified column order.
- In the above example, each row value is averaged with the three preceding and following rows within the same State group.
- If an immediately preceding row does not have the same state, earlier rows are searched.
The currently available window functions are as follows:
- row_number()
- lead(colname, int)
- lag(colname, int)
- rolling_sum(colname, int, int)
- rolling_avg(colname, int, int)
In addition to window functions, aggregate functions may be used.
Notes
- When using window functions, error messages may not be properly displayed in the event of insufficient arguments.
Create a data snapshot¶
When rule editing is complete, you can create a data snapshot of the finalized dataset, which can then be downloaded to your local PC or ingested into the Metatron engine. Running the data snapshot applies the rules to the entire data, which, in the process of rule editing, applied to a sample dataset of less than 10,000 rows.
Below are instructions on creating a snapshot:
Click the Data Snapshot button on the upper right of the Edit rules window.
![]()
When a popup is displayed to set snapshot options, select either FileSystem or HIVE (STAGING_DB) under Snapshot type.
![]()
If FileSystem is selected as the snapshot location, the snapshot will be created as CSV or JSON.
![]()
The HIVE option is available only when STAGING_DB is enabled. A snapshot is created in the table when you designate a schema name and table name.
![]()
When the snapshot is created, you can view the snapshot status and related information in the same window.
![]()
Use data snapshot results¶
A data snapshot created through a dataflow can be used as follows:
Check the data snapshot results¶
The status of snapshot creation can be classified as follows:
- Success = SUCCEEDED
- Failed = FAILED
- Preparing = INITIALIZING, RUNNING, WRITING, TABLE_CREATING, CANCELING
You can view the details of snapshot creation through the two paths below:
Go to the snapshot list under MANGEMENT > Data Preparation > Data Snapshot.
![]()
Click the Snapshot (#) tab on the right of the Edit rules page in Dataflow
![]()
In the snapshot details page, you can view details such as data validity ratio and a grid of the created snapshot, and download the results as a CSV file (Download as CSV).
If valid data has not been created, the snapshot details page displays an error log.
Ingest into the Metatron engine¶
(upcoming feature)
Download as a CSV file¶
In the details page of a successfully created snapshot, the Download as CSV option is enabled.
The downloaded file is a standard CSV, with each value separated by a “comma” and each row by a “new line.”