Skip to content

Releases: apache/druid

Druid 25.0.0

04 Jan 07:43
Compare
Choose a tag to compare

Apache Druid 25.0.0 contains over 300 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 51 contributors.

See the complete set of changes for additional details.

# Highlights

# MSQ task engine now production ready

The multi-stage query (MSQ) task engine used for SQL-based ingestion is now production ready. Use it for any supported workloads. For more information, see the following pages:

# Simplified Druid deployments

The new start-druid script greatly simplifies deploying any combination of Druid services on a single-server. It comes pre-packaged with the required configs and can be used to launch a fully functional Druid cluster simply by invoking ./start-druid. For experienced Druids, it also gives complete control over the runtime properties and JVM arguments to have a cluster that exactly fits your needs.

The start-druid script deprecates the existing profiles such as start-micro-quickstart and start-nano-quickstart. These profiles may be removed in future releases. For more information, see Single server deployment.

# String dictionary compression (experimental)

Added support for front coded string dictionaries for smaller string columns, leading to reduced segment sizes with only minor performance penalties for most Druid queries.

This can be enabled by setting IndexSpec.stringDictionaryEncoding to {"type":"frontCoded", "bucketSize": 4} , where bucketSize is any power of 2 less than or equal to 128. Setting this property instructs indexing tasks to write segments using compressed dictionaries of the specified bucket size.

Any segment written using string dictionary compression is not readable by older versions of Druid.

For more information, see Front coding.

#12277

# Kubernetes-native tasks

Druid can now use Kubernetes to launch and manage tasks, eliminating the need for middle managers.

To use this feature, enable the druid-kubernetes-overlord-extensions in the extensions load list for your Overlord process.

#13156

# Hadoop-3 compatible binary

Druid now comes packaged as a dedicated binary for Hadoop-3 users, which contains Hadoop-3 compatible jars. If you do not use Hadoop-3 with your Druid cluster, you may continue using the classic binary.

# Multi-stage query (MSQ) task engine

# MSQ enabled for Docker

MSQ task query engine is now enabled for Docker by default.

#13069

# Query history

Multi-stage queries no longer show up in the Query history dialog. They are still available in the Recent query tasks panel.

# Limit on CLUSTERED BY columns

When using the MSQ task engine to ingest data, the number of columns that can be passed in the CLUSTERED BY clause is now limited to 1500.

#13352

# Support for string dictionary compression

The MSQ task engine supports the front-coding of String dictionaries for better compression. This can be enabled for INSERT or REPLACE statements by setting indexSpec to a valid json string in the query context.

#13275

# Sketch merging mode

Workers can now gather key statistics, used to generate partition boundaries, either sequentially or in parallel. Set clusterStatisticsMergeMode to PARALLEL, SEQUENTIAL or AUTO in the query context to use the corresponding sketch merging mode. For more information, see Sketch merging mode.

#13205

# Performance and operational improvements

  • Error messages: For disallowed MSQ warnings of certain types, the warning is now surfaced as the error. #13198
  • Secrets: For tasks containing SQL with sensitive keys, Druid now masks the keys while logging with the help regular expressions. #13231
  • Downsampling accuracy: MSQ task engine now uses the number of bytes instead of number of keys when downsampling data. #12998
  • Memory usage: When determining partition boundaries, the heap footprint of internal sketches used by MSQ is now capped at 10% of available memory or 300 MB, whichever is lower. Previously, the cap was strictly 300 MB. #13274
  • Task reports: Added fields pendingTasks and runningTasks to the worker report. See Query task status information for related web console changes. #13263

# Querying

# Async reads for JDBC

Prevented JDBC timeouts on long queries by returning empty batches when a batch fetch takes too long. Uses an async model to run the result fetch concurrently with JDBC requests.

#13196

# Improved algorithm to check values of an IN filter

To accommodate large value sets arising from large IN filters or from joins pushed down as IN filters, Druid now uses a sorted merge algorithm for merging the set and dictionary for larger values.

#13133

# Enhanced query context security

Added the following configuration properties that refine the query context security model controlled by druid.auth.authorizeQueryContextParams:

  • druid.auth.unsecuredContextKeys: A JSON list of query context keys that do not require a security check.
  • druid.auth.securedContextKeys: A JSON list of query context keys that do require a security check.

If both are set, unsecuredContextKeys acts as exceptions to securedContextKeys.

#13071

# HTTP response headers

The HTTP response for a SQL query now correctly sets response headers, same as a native query.

#13052

# Metrics

# New metrics

The following metrics have been newly added. For more details, see the complete list of Druid metrics.

# Batched segment allocation

These metrics pertain to batched segment allocation.

Metric Description Dimensions
task/action/batch/runTime Milliseconds taken to execute a batch of task actions. Currently only being emitted for batched segmentAllocate actions dataSource, taskActionType=segmentAllocate
task/action/batch/queueTime Milliseconds spent by a batch of task actions in queue. Currently only being emitted for batched segmentAllocate actions dataSource, taskActionType=segmentAllocate
task/action/batch/size Number of task actions in a batch that was executed during the emission period. Currently only being emitted for batched segmentAllocate actions dataSource, taskActionType=segmentAllocate
task/action/batch/attempts Number of execution attempts for a single batch of task actions. Currently only being emitted for batched segmentAllocate actions dataSource, taskActionType=segmentAllocate
task/action/success/count Number of task actions that were executed successfully during the emission period. Currently only being emitted for batched segmentAllocate actions dataSource, taskId, taskType, taskActionType=segmentAllocate
task/action/failed/count Number of task actions that failed during the emission period. Currently only being emitted for batched segmentAllocate actions dataSource, taskId, `tas...
Read more

Druid 24.0.2

22 Dec 02:12
Compare
Choose a tag to compare

Apache Druid 24.0.2 is a bug fix release that fixes some issues in the 24.0.1 release.
See the complete set of changes for additional details.

# Bug fixes

#13138 to fix dependency errors while launching a Hadoop task.

# Credits

@kfaraz
@LakshSingla

Druid 24.0.1

22 Nov 18:23
Compare
Choose a tag to compare

Apache Druid 24.0.1 is a bug fix release that fixes some issues in the 24.0 release.
See the complete set of changes for additional details.

# Notable Bug fixes

#13214 to fix SQL planning when using the JSON_VALUE function.
#13297 to fix values that match a range filter on nested columns.
#13077 to fix detection of nested objects while generating an MSQ SQL in the web-console.
#13172 to correctly handle overlord leader election even when tasks cannot be reacquired.
#13259 to fix memory leaks from SQL statement objects.
#13273 to fix overlord API failures by de-duplicating task entries in memory.
#13049 to fix a race condition while processing query context.
#13151 to fix assertion error in SQL planning.

# Credits

Thanks to everyone who contributed to this release!

@abhishekagarwal87
@AmatyaAvadhanula
@clintropolis
@gianm
@kfaraz
@LakshSingla
@paul-rogers
@vogievetsky

# Known issues

  • Hadoop ingestion does not work with custom extension config due to injection errors
    (fixed in #13138)

Druid 24.0.0

16 Sep 14:48
Compare
Choose a tag to compare

Apache Druid 24.0.0 contains over 300 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 67 contributors. See the complete set of changes for additional details.

# Major version bump

Starting with this release, we have dropped the leading 0 from the release version and promoted all other digits one place to the left. Druid is now at major version 24, a jump up from the prior 0.23.0 release. In terms of backward-compatibility or breaking changes, this release is not significantly different than other previous major releases such as 0.23.0 or 0.22.0. We are continuing with the same policy as we have used in prior releases: minimizing the number of changes that require special attention when upgrading, and calling out any that do exist in the release notes. For this release, please refer to the Upgrading to 24.0.0 section for a list of backward-incompatible changes in this release.

# New Features

# Multi-stage query task engine

SQL-based ingestion for Apache Druid uses a distributed multi-stage query architecture, which includes a query engine called the multi-stage query task engine (MSQ task engine). The MSQ task engine extends Druid's query capabilities, so you can write queries that reference external data as well as perform ingestion with SQL INSERT and REPLACE. Essentially, you can perform SQL-based ingestion instead of using JSON ingestion specs that Druid's native ingestion uses. In addition to the easy-to-use syntax, the SQL interface lets you perform transformations that involve multiple shuffles of data.

SQL-based ingestion using the multi-stage query task engine is recommended for batch ingestion starting in Druid 24.0.0. Native batch and Hadoop-based ingestion continue to be supported as well. We recommend you review the known issues and test the feature in a staging environment before rolling out in production. Using the multi-stage query task engine with plain SELECT statements (not INSERT ... SELECT or REPLACE ... SELECT) is experimental.

If you're upgrading from an earlier version of Druid or you're using Docker, you'll need to add the druid-multi-stage-query extension to druid.extensions.loadlist in your common.runtime.properties file.

For more information, refer to the Overview documentation for SQL-based ingestion.

#12524
#12386
#12523
#12589

# Nested columns

Druid now supports directly storing nested data structures in a newly added COMPLEX<json> column type. COMPLEX<json> columns store a copy of the structured data in JSON format as well as specialized internal columns and indexes for nested literal values—STRING, LONG, and DOUBLE types. An optimized virtual column allows Druid to read and filter these values at speeds consistent with standard Druid LONG, DOUBLE, and STRING columns.

Newly added Druid SQL, native JSON functions, and virtual column allow you to extract, transform, and create COMPLEX<json> values in at query time. You can also use the JSON functions in INSERT and REPLACE statements in SQL-based ingestion, or in a transformSpec in native ingestion as an alternative to using a flattenSpec object to "flatten" nested data for ingestion.

See SQL JSON functions, native JSON functions, Nested columns, virtual columns, and the feature summary for more detail.

#12753
#12714
#12753
#12920

# Updated Java support

Java 11 is fully supported is no longer experimental. Java 17 support is improved.

#12839

# Query engine updates

# Updated column indexes and query processing of filters

Reworked column indexes to be extraordinarily flexible, which will eventually allow us to model a wide range of index types. Added machinery to build the filters that use the updated indexes, while also allowing for other column implementations to implement the built-in index types to provide adapters to make use indexing in the current set filters that Druid provides.

#12388

# Time filter operator

You can now use the Druid SQL operator TIME_IN_INTERVAL to filter query results based on time. Prefer TIME_IN_INTERVAL over the SQL BETWEEN operator to filter on time. For more information, see Date and time functions.

#12662

# Null values and the "in" filter

If a values array contains null, the "in" filter matches null values. This differs from the SQL IN filter, which does not match null values.

For more information, see Query filters and SQL data types.
#12863

# Virtual columns in search queries

Previously, a search query could only search on dimensions that existed in the data source. Search queries now support virtual columns as a parameter in the query.

#12720

# Optimize simple MIN / MAX SQL queries on __time

Simple queries like select max(__time) from ds now run as a timeBoundary queries to take advantage of the time dimension sorting in a segment. You can set a feature flag to enable this feature.

#12472
#12491

# String aggregation results

The first/last string aggregator now only compares based on values. Previously, the first/last string aggregator’s values were compared based on the _time column first and then on values.

If you have existing queries and want to continue using both the _time column and values, update your queries to use ORDER BY MAX(timeCol).

#12773

# Reduced allocations due to Jackson serialization

Introduced and implemented new helper functions in JacksonUtils to enable reuse of
SerializerProvider objects.

Additionally, disabled backwards compatibility for map-based rows in the GroupByQueryToolChest by default, which eliminates the need to copy the heavyweight ObjectMapper. Introduced a configuration option to allow administrators to explicitly enable backwards compatibility.

#12468

# Updated IPAddress Java library

Added a new IPAddress Java library dependency to handle IP addresses. The library includes IPv6 support. Additionally, migrated IPv4 functions to use the new library.

#11634

# Query performance improvements

Optimized SQL operations and functions as follows:

  • Vectorized numeric latest aggregators (#12439)
  • Optimized isEmpty() and equals() on RangeSets (#12477)
  • Optimized reuse of Yielder objects (#12475)
  • Operations on numeric columns with indexes are now faster (#12830)
  • Optimized GroupBy by reducing allocations. Reduced allocations by reusing entry and key holders (#12474)
  • Added a vectorized version of string last aggregator (#12493)
  • Added Direct UTF-8 access for IN filters (#12517)
  • Enabled virtual columns to cache their outputs in case Druid calls them multiple times on the same underlying row (#12577)
  • Druid now rewrites a join as a filter when possible in IN joins (#12225)
  • Added automatic sizing for GroupBy dictionaries (#12763)
  • Druid now distributes JDBC connections more evenly amongst brokers (#12817)

# Streaming ingestion

# Kafka consumers

Previously, consumers that were registered and used for ingestion persisted until Kafka deleted the...

Read more

Druid 0.23.0

23 Jun 03:31
Compare
Choose a tag to compare

Apache Druid 0.23.0 contains over 450 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 81 contributors. See the complete set of changes for additional details.

# New Features

# Query engine

# Grouping on arrays without exploding the arrays

You can now group on a multi-value dimension as an array. For a datasource named "test":

{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}  #row1
{"timestamp": "2011-01-13T00:00:00.000Z", "tags": ["t3","t4","t5"]}  #row2
{"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]}  #row3
{"timestamp": "2011-01-14T00:00:00.000Z", "tags": []}                #row4

The following query:

{
  "queryType": "groupBy",
  "dataSource": "test",
  "intervals": [
    "1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z"
  ],
  "granularity": {
    "type": "all"
  },
  "virtualColumns" : [ {
    "type" : "expression",
    "name" : "v0",
    "expression" : "mv_to_array(\"tags\")",
    "outputType" : "ARRAY<STRING>"
  } ],
  "dimensions": [
    {
      "type": "default",
      "dimension": "v0",
      "outputName": "tags"
      "outputType":"ARRAY<STRING>"
    }
  ],
  "aggregations": [
    {
      "type": "count",
      "name": "count"
    }
  ]
}

Returns the following:

[
 {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 1,
      "tags": "[]"
    }
  },
  {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 1,
      "tags": "["t1","t2","t3"]"
    }
  },
  {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 1,
      "tags": "[t3","t4","t5"]"
    }
  },
  {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 2,
      "tags": "["t5","t6","t7"]"
    }
  }
]

(#12078)
(#12253)

# Specify a column other than __time column for row comparison in first/last aggregators

You can pass time column in *first/*last aggregators by using LATEST_BY / EARLIEST_BY SQL functions. This provides support for cases where the time is stored as a part of a column different than "__time". You can also specify another logical time column.
(#11949)
(#12145)

# Improvements to querying user experience

This release includes several improvements for querying:

  • Added the SQL query ID to response header for failed SQL query to aid in locating the error messages (#11756)
  • Added input type validation for DataSketches HLL (#12131)
  • Improved JDBC logging (#11676)
  • Added SQL functions MV_FILTER_ONLY and MV_FILTER_NONE to filter rows of multi-value string dimensions to include only the supplied list of values or none of them respectively (#11650)
  • Added ARRAY_CONCAT_AGG to aggregate array inputs together into a single array (#12226)
  • Added the ability to authorize the usage of query context parameters (#12396)
  • Improved query IDs to make it easier to link queries and sub-queries for end-to-end query visibility (#11809)
  • Added a safe divide function to protect against division by 0 (#11904)
  • You can now add a query context to internally generated SegmentMetadata query (#11429)
  • Added support for Druid complex types to the native expression processing system to make all Druid data usable within expressions (#11853, #12016)
  • You can control the size of the on-heap segment-level dictionary via druid.query.groupBy.maxSelectorDictionarySize when grouping on string or array-valued expressions that do not have pre-existing dictionaries.
  • You have better protection against filter explosion during CNF conversion (#12314) (#12324)
  • You can get the complete native query on explaining the SQL query by setting useNativeQueryExplain to true in query context (#11908)
  • You can have broker ignore real time nodes or specific historical tiers. (#11766) (#11732)

# Streaming Ingestion

# Kafka input format for parsing headers and key

We've introduced a Kafka input format so you can ingest header data in addition to the message contents. For example:

  • the event key field
  • event headers
  • the Kafka event timestamp
  • the Kafka event value that stores the payload.

(#11630)

# Kinesis ingestion - Improvements

We have made following improvements in kinesis ingestion

  • Re-sharding can affect and slow down ingestion as many intermediate empty shards are created. These shards get assigned to tasks causing imbalance in load assignment. You can set skipIgnorableShards to true in kinesis ingestion tuning config to ignore such shards. (#12235)
  • Currently, kinesis ingestion uses DescribeStream to fetch the list of shards. This call is deprecated and slower. In this release, you can switch to a newer API listShards by setting useListShards to true in kinesis ingestion tuning config. (#12161)

# Native Batch Ingestion

# Multi-dimension range partitioning

Multi-dimension range partitioning allows users to partition their data on the ranges of any number of dimensions. It develops further on the concepts behind "single-dim" partitioning and is now arguably the most preferable secondary partitioning, both for query performance and storage efficiency.
(#11848)
(#11973)

# Improved replace data behavior

In previous versions of Druid, if ingested data with dropExisting flag to replace data, Druid would retain the existing data for a time chunk if there was no new data to replace it. Now, if you set dropExisting to true in your ioSpec and ingest data for a time range that includes a time chunk with no data, Druid uses a tombstone to overshadow the existing data in the empty time chunk.
(#12137)

This release includes several improvements for native batch ingestion:

  • Druid now emits a new metric when a batch task finishes waiting for segment availability. (#11090)
  • Added segmentAvailabilityWaitTimeMs, the duration in milliseconds that a task waited for its segments to be handed off to Historical nodes, to IngestionStatsAndErrorsTaskReportData (#11090)
  • Added functionality to preserve existing metrics during ingestion (#12185)
  • Parallel native batch task can now provide task reports for the sequential and single phase mode (e.g., used with dynamic partitioning) as well as single phase mode subtasks (#11688)
  • Added support for RowStats in druid/indexer/v1/task/{task_id}/reports API for multi-phase parallel indexing task (#12280)
  • Fixed the OOM failures in the dimension distribution phase of parallel indexing (#12331)
  • Added support to handle null dimension values while creating partition boundaries (#11973)

# Improvements to ingestion in general

This release includes several improvements for ingestion in general:

  • Removed the template modifier from IncrementalIndex<AggregatorType> because it is no longer required
  • You can now use JsonPath functions in JsonPath expressions during ingestion (#11722)
  • Druid no longer creates a materialized list of segment files and elimited looping over the files to reduce OOM issues (#11903)
  • Added an intermediate-persist IndexSpec to the main "merge" method in IndexMerger (#11940)
  • Granularity.granularitiesFinerThan now returns ALL if you pass in ALL (#12003)
  • Added a configuation parameter for appending tasks to allow them to use a SHARED lock (#12041)
  • SchemaRegistryBasedAvroBytesDecoder now throws a ParseException instead of RE when it fails to retrieve a schema (#12080)
  • Added includeAllDimensions to dimensionsSpec to put all explicit dimensions first in InputRow and subsequently any other dimensions found in input data (#12276)
  • Added the ability to store null columns in segments (#12279)

# Compaction

This release includes several improvements for compaction:

  • Automatic compaction now supports complex dimensions (#11924)
  • Automatic compaction now supports overlapping segment in...
Read more

druid-0.22.1

11 Dec 09:24
Compare
Choose a tag to compare

Apache Druid 0.22.1 is a bug fix release that fixes some security issues. See the complete set of changes for additional details.

# Bug fixes

#12051 Update log4j to 2.15.0 to address CVE-2021-44228
#11787 JsonConfigurator no longer logs sensitive properties
#11786 Update axios to 0.21.4 to address CVE-2021-3749
#11844 Update netty4 to 4.1.68 to address CVE-2021-37136 and CVE-2021-37137

# Credits

Thanks to everyone who contributed to this release!

@abhishekagarwal87
@andreacyc
@clintropolis
@gianm
@jihoonson
@kfaraz
@xvrl

druid-0.22.0

22 Sep 22:24
Compare
Choose a tag to compare

Apache Druid 0.22.0 contains over 400 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 73 contributors. See the complete set of changes for additional details.

# New features

# Query engine

# Support for multiple distinct aggregators in same query

Druid now can support multiple DISTINCT 'exact' counts using the grouping aggregator typically used with grouping sets. Note that this only applies to exact counts - when druid.sql.planner.useApproximateCountDistinct is false, and can be enabled by setting druid.sql.planner.useGroupingSetForExactDistinct to true.

#11014

# SQL ARRAY_AGG and STRING_AGG aggregator functions

The ARRAY_AGG aggregation function has been added, to allow accumulating values or distinct values of a column into a single array result. This release also adds STRING_AGG, which is similar to ARRAY_AGG, except it joins the array values into a single string with a supplied 'delimiter' and it ignores null values. Both of these functions accept a maximum size parameter to control maximum result size, and will fail if this value is exceeded. See SQL documentation for additional details.

#11157
#11241

# Bitwise math function expressions and aggregators

Several new SQL functions functions for performing 'bitwise' math (along with corresponding native expressions), including BITWISE_AND, BITWISE_OR, BITWISE_XOR and so on. Additionally, aggregation functions BIT_AND, BIT_OR, and BIT_XOR have been added to accumulate values in a column with the corresponding bitwise function. For complete details see SQL documentation.

#10605
#10823
#11280

# Human readable number format functions

Three new SQL and native expression number format functions have been added in Druid 0.22.0, HUMAN_READABLE_BINARY_BYTE_FORMAT, HUMAN_READABLE_DECIMAL_BYTE_FORMAT, and HUMAN_READABLE_DECIMAL_FORMAT, which allow transforming results into a more friendly consumption format for query results. For more information see SQL documentation.

#10584
#10635

# Expression aggregator

Druid 0.22.0 adds a new 'native' JSON query expression aggregator function, that lets you use Druid native expressions to perform "fold" (alternatively known as "reduce") operations to accumulate some value on any number of input columns. This adds significant flexibility to what can be done in a Druid aggregator, similar in a lot of ways to what was possible with the Javascript aggregator, but in a much safer, sandboxed manner.

Expressions now being able to perform a "fold" on input columns also really rounds out the abilities of native expressions in addition to the previously possible "map" (expression virtual columns), "filter" (expression filters) and post-transform (expression post-aggregators) functions.

Since this uses expressions, performance is not yet optimal, and it is not directly documented yet, but it is the underlying technology behind the SQL ARRAY_AGG, STRING_AGG, and bitwise aggregator functions also added in this release.

#11104

# SQL query routing improvements

Druid 0.22 adds some new facilities to provide extension writers with enhanced control over how queries are routed between Druid routers and brokers. The first adds a new manual broker selection strategy to the Druid router, which allows a query to manually specify which Druid brokers a query should be sent to based on a query context parameter brokerService to any broker pool defined in druid.router.tierToBrokerMap (this corresponds to the 'service name' of the broker set, druid.service).

The second new feature allows the Druid router to parse and examine SQL queries so that broker selection strategies can also function for SQL queries. This can be enabled by setting druid.router.sql.enable to true. This does not affect JDBC queries, which use a different mechanism to facilitate "sticky" connections to a single broker.

#11566
#11495

# Avatica protobuf JDBC Support

Druid now supports using Avatica Protobuf JDBC connections, such as for use with the Avatica Golang Driver, and has a separate endpoint from the JSON JDBC uri.

String url = "jdbc:avatica:remote:url=http://localhost:8082/druid/v2/sql/avatica-protobuf/;serialization=protobuf";

#10543

# Improved query error logging

Query exceptions have been changed from WARN level to ERROR level to include additional information in the logs to help troubleshoot query failures. Additionally, a new query context flag, enableQueryDebugging has been added that will include stack traces in these query error logs, to provide even more information without the need to enable logs at the DEBUG level.

#11519

# Streaming Ingestion

# Task autoscaling for Kafka and Kinesis streaming ingestion

Druid 0.22.0 now offers experimental support for dynamic Kafka and Kinesis task scaling. The included strategies are driven by periodic measurement of stream lag (which is based on message count for Kafka, and difference of age between the message iterator and the oldest message for Kinesis), and will adjust the number of tasks based on the amount of 'lag' and several configuration parameters. See Kafka and Kinesis documentation for complete information.

#10524
#10985

# Avro and Protobuf streaming InputFormat and Confluent Schema Registry Support

Druid streaming ingestion now has support for Avro and Protobuf in the updated InputFormat specification format, which replaces the deprecated firehose/parser specification used by legacy Druid streaming formats. Alongside this, comes support for obtaining schemas for these formats from Confluent Schema Registry. See data formats documentation for further information.

#11040
#11018
#10314
#10839

# Kafka ingestion support for specifying group.id

Druid Kafka streaming ingestion now optionally supports specifying group.id on the connections Druid tasks make to the Kafka brokers. This is useful for accessing clusters which require this be set as part of authorization, and can be specified in the consumerProperties section of the Kafka supervisor spec. See Kafka ingestion documentation for more details.

#11147

# Native Batch Ingestion

# Support for using deep storage for intermediary shuffle data

Druid native 'perfect rollup' 2-phase ingestion tasks now support using deep storage as a shuffle location, as an alternative to local disks on middle-managers or indexers. To use this feature, set druid.processing.intermediaryData.storage.type to deepstore, which uses the configured deep storage type.

Note - With "deepstore" type, data is stored in shuffle-data directory under the configured deep storage path, auto clean up for this directory is not supported yet. One can setup cloud storage lifecycle rules for auto clean up of data at shuffle-data prefix location.

#11507

# Improved native batch ingestion task memory usage

Druid native batch ingestion has received a new configuration option, druid.indexer.task.batchProcessingMode which introduces two new operating modes that should allow batch ingestion to operate with a smaller and more predictable heap memory usage footprint. The CLOSED_SEGMENTS_SINKS mode is the most aggressive, and should have the smallest memory footprint, and works by eliminating in memory tracking and mmap of intermediary segments produced during segment creation, but isn't super well tested at this point so considered experimental...

Read more

druid-0.21.1

10 Jun 23:14
Compare
Choose a tag to compare

Apache Druid 0.21.1 is a bug fix release that fixes a few regressions with the 0.21 release. The first is an issue with the published Docker image, which causes containers to fail to start due to volume permission issues, described in #11166 as fixed in #11167. This release also fixes an issue caused by a bug in the upgraded Jetty version which was released in 0.21, described in #11206 and fixed in #11207. Finally, a web console regression related to field validation has been added in #11228.

# Bug fixes

#11167 fix docker volume permissions
#11207 Upgrade jetty version
#11228 Web console: Fix required field treatment
#11299 Fix permission problems in docker

# Credits

Thanks to everyone who contributed to this release!

@a2l007
@clintropolis
@FrankChen021
@maytasm
@vogievetsky

druid-0.21.0

28 Apr 00:26
Compare
Choose a tag to compare

Apache Druid 0.21.0 contains around 120 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 36 contributors. Refer to the complete list of changes and everything tagged to the milestone for further details.

# New features

# Operation

# Service discovery and leader election based on Kubernetes

The new Kubernetes extension supports service discovery and leader election based on Kubernetes. This extension works in conjunction with the HTTP-based server view (druid.serverview.type=http) and task management (druid.indexer.runner.type=httpRemote) to allow you to run a Druid cluster with zero ZooKeeper dependencies. This extension is still experimental. See Kubernetes extension for more details.

#10544
#9507
#10537

# New dynamic coordinator configuration to limit the number of segments when finding a candidate segment for segment balancing

You can set the percentOfSegmentsToConsiderPerMove to limit the number of segments considered when picking a candidate segment to move. The candidates are searched up to maxSegmentsToMove * 2 times. This new configuration prevents Druid from iterating through all available segments to speed up the segment balancing process, especially if you have lots of available segments in your cluster. See Coordinator dynamic configuration for more details.

#10284

# status and selfDiscovered endpoints for Indexers

The Indexer now supports status and selfDiscovered endpoints. See Processor information APIs for details.

#10679

# Querying

# New grouping aggregator function

You can use the new grouping aggregator SQL function with GROUPING SETS or CUBE to indicate which grouping dimensions are included in the current grouping set. See Aggregation functions for more details.

#10518

# Improved missing argument handling in expressions and functions

Expression processing now can be vectorized when inputs are missing. For example a non-existent column. When an argument is missing in an expression, Druid can now infer the proper type of result based on non-null arguments. For instance, for longColumn + nonExistentColumn, nonExistentColumn is treated as (long) 0 instead of (double) 0.0. Finally, in default null handling mode, math functions can produce output properly by treating missing arguments as zeros.

#10499

# Allow zero period for TIMESTAMPADD

TIMESTAMPADD function now allows zero period. This functionality is required for some BI tools such as Tableau.

#10550

# Ingestion

# Native parallel ingestion no longer requires explicit intervals

Parallel task no longer requires you to set explicit intervals in granularitySpec. If intervals are missing, the parallel task executes an extra step for input sampling which collects the intervals to index.

#10592
#10647

# Old Kafka version support

Druid now supports Apache Kafka older than 0.11. To read from an old version of Kafka, set the isolation.level to read_uncommitted in consumerProperties. Only 0.10.2.1 have been tested up until this release. See Kafka supervisor configurations for details.

#10551

Multi-phase segment merge for native batch ingestion

A new tuningConfig, maxColumnsToMerge, controls how many segments can be merged at the same time in the task. This configuration can be useful to avoid high memory pressure during the merge. See tuningConfig for native batch ingestion for more details.

#10689

# Native re-ingestion is less memory intensive

Parallel tasks now sort segments by ID before assigning them to subtasks. This sorting minimizes the number of time chunks for each subtask to handle. As a result, each subtask is expected to use less memory, especially when a single Parallel task is issued to re-ingest segments covering a long time period.

#10646

# Web console

# Updated and improved web console styles

The new web console styles make better use of the Druid brand colors and standardize paddings and margins throughout. The icon and background colors are now derived from the Druid logo.

image

#10515

# Partitioning information is available in the web console

The web console now shows datasource partitioning information on the new Segment granularity and Partitioning columns.

Segment granularity column in the Datasources tab

97240667-1b9cb280-17ac-11eb-9c55-e312c24cd8fc

Partitioning column in the Segments tab

97240597-ebedaa80-17ab-11eb-976f-a0d49d6d1a40

#10533

# The column order in the Schema table matches the dimensionsSpec

The Schema table now reflects the dimension ordering in the dimensionsSpec.

image

#10588

# Metrics

# Coordinator duty runtime metrics

The coordinator performs several 'duty' tasks. For example segment balancing, loading new segments, etc. Now there are two new metrics to help you analyze how fast the Coordinator is executing these duties.

  • coordinator/time: the time for an individual duty to execute
  • coordinator/global/time: the time for the whole duties runnable to execute

#10603

# Query timeout metric

A new metric provides the number of timed out queries. Previously timed out queries were treated as interrupted and included in the query/interrupted/count (see Changed HTTP status codes for query errors for more details).

query/timeout/count: the number of timed out queries during the emission period

#10567

# Shuffle metrics for batch ingestion

Two new metrics provide shuffle statistics for MiddleManagers and Indexers. These metrics have the supervisorTaskId as their dimension.

  • ingest/shuffle/bytes: number of bytes shuffled per emission period
  • ingest/shuffle/requests: number of shuffle requests per emission period

To enable the shuffle metrics, add org.apache.druid.indexing.worker.shuffle.ShuffleMonitor in druid.monitoring.monitors. See Shuffle metrics for more details.

#10359

# New clock-drift safe metrics monitor scheduler

The default metrics monitor scheduler is implemented based on ScheduledThreadPoolExecutor which is prone to unbounded clock drift. A new monitor scheduler, ClockDriftSafeMonitorScheduler, overcomes this limitation. To use the new scheduler, set druid.monitoring.schedulerClassName to org.apache.druid.java.util.metrics.ClockDriftSafeMonitorScheduler in the runtime.properties file.

#10448
#10732

# Others

# New extension for a password p...

Read more

druid-0.20.2

29 Mar 19:00
Compare
Choose a tag to compare

Apache Druid 0.20.2 introduces new configurations to address CVE-2021-26919: Authenticated users can execute arbitrary code from malicious MySQL database systems. Users are recommended to enable new configurations in the below to mitigate vulnerable JDBC connection properties. These configurations will be applied to all JDBC connections for ingestion and lookups, but not for metadata store. See security configurations for more details.

  • druid.access.jdbc.enforceAllowedProperties: When true, Druid applies druid.access.jdbc.allowedProperties to JDBC connections starting with jdbc:postgresql: or jdbc:mysql:. When false, Druid allows any kind of JDBC connections without JDBC property validation. This config is set to false by default to not break rolling upgrade. This config is deprecated now and can be removed in a future release. The allow list will be always enforced in that case.
  • druid.access.jdbc.allowedProperties: Defines a list of allowed JDBC properties. Druid always enforces the list for all JDBC connections starting with jdbc:postgresql: or jdbc:mysql: if druid.access.jdbc.enforceAllowedProperties is set to true. This option is tested against MySQL connector 5.1.48 and PostgreSQL connector 42.2.14. Other connector versions might not work.
  • druid.access.jdbc.allowUnknownJdbcUrlFormat: When false, Druid only accepts JDBC connections starting with jdbc:postgresql: or jdbc:mysql:. When true, Druid allows JDBC connections to any kind of database, but only enforces druid.access.jdbc.allowedProperties for PostgreSQL and MySQL.