Metatron powered by Druid¶

As explained previously, Metatron employs Druid as its underlying engine and has made developments and improvements of Druid for its own uses. This section introduces the background, progress, and results of the adoption of Druid to Metatron.

Metatron development background and Druid integration¶

Metatron as a big data analytics solution¶

As a telecommunications service provider with the most number of subscribers in South Korea, SK Telecom has exerted significant efforts to establish a stable network environment through by using the mass amounts of network data logs generated by its users.

Due to the limitations of existing IT infrastructure in mass data processing, SK Telecom needed a big-data warehousing system (Apache Hadoop) and a big-data analytics solution compatible with the system. The company built its own Hadoop infrastructure to store mass amounts of data at low cost, but faced the following limitations:

Network data generated by the countless users could not be analyzed in real time. Although it was possible to store and process big data, visualizations could be implemented only with a sampled subset of data in the same way as on legacy systems.

Having different solutions and different managers support each stage of data analytics, such as ETL, DW, and BI, not only involved significant time and costs, but also resulted in poor data accessibility. An end-to-end solution was needed to analyze all stages at once in a simple and quick manner.

Why the Druid engine¶

Druid was the optimal engine for the Metatron solution because it fulfilled the aforementioned needs with the features below:

Druid collects mass amounts of data in real time and indexes them into a queryable format, ensuring very fast data aggregations (a few seconds at the slowest) based on distributed processing.

Druid’s OLAP time-series data format enables analysts to perform data exploration, filtering, and visualization as desired. Such free and flexible data exploration is essential for users to intuitively select the required data and determine correlations between different dimensions on it.

Druid’s extensible architecture allows modules to be easily added.

Built on this architecture, Metatron is an end-to-end solution that embraces all layers of data collection, storage, processing, analysis, and visualization.

Druid engine integration¶

The Druid engine was integrated in Metatron as follows:

With Druid as the basic engine for processing/analytics, the GUI was designed to support users in different professional domains and big-data analysts in data-related tasks such as data preparation, analytics, and visualization, as well as the sharing of results.

IT administrators can manage/monitor data sources in Druid, and they can establish data preparation rules if data sources of higher quality are required.

Druid functions reinforced in Metatron¶

The open-source Druid, despite its strengths in data collection and processing, had to be improved for Metatron to properly function as an end-to-end solution. This section examines the limitations of the open-source Druid and the functions reinforced in Metatron.

Limitations of the open-source Druid¶

The open-source Druid has the following limitations:

Since Druid does not yet have full support for joins, Metatron uses another SQL engine for data preparation.

Druid supports only a subset of SQL queries.

For a data lake, a traditional SQL engine is more appropriate.

Druid cannot append to or update already indexed segments, except for in some unusual cases.

Nulls are not allowed.

Filtering is not supported for metric columns.

Linear scalability is not ensured. Increasing the number of servers doesn’t improve the performance as much.

Only a few data types are supported and it is difficult to add a new one.

The management and monitoring tools are not powerful enough.

Druid functions reinforced in Metatron¶

The following functions of Druid were strengthened in Metatron:

Query functionality improvements

Improved the functionality of the GroupBy query type.

Slightly improved the functionality of other types of queries.

Features added

Virtual columns (map, expression. etc.)

New metric types (double, string, array, etc.)

New expression functions

Druid query results can be stored on the HDFS or exported into a file.

Queries for meta information and statistics

New aggregate functions (variance, correlation, etc.)

(Limited) Window functions (lead, lar, running aggregations, etc.)

(Limited) Joins

(Limited) Sub-queries

Temporary data sources

Complex queries (data source summarization, correlation between data sources, k-means, etc.)

Custom columns grouping

Geographic information system (GIS) supported

Columnar histograms

Bit-slice indexing

Index structure improvements

Histograms for filtering on metrics

Lucene format supported for text filtering

Connectability with other systems

Hive storage handler

Ingestion into Hive tables (based on connection with the Hive metastore)

Ingestion into the ORC format

RDBMS data ingestion via based on JDBC

(Limited) SQL support backported

Miscellaneous improvements

Bug fixes (+50) and minor improvements