CSV uploads for ClickHouse Cloud #236

calherries · 2024-05-10T19:29:49Z

Summary

Issue tracking CSV uploads: #230

This PR adds support for the uploads driver feature for ClickHouse. The feature is currently only supported for ClickHouse Cloud. It requires Metabase ~~x.49.11~~ x.49.12 to work correctly.

Limitations:

OffsetDateTime values in the CSV e.g. "2022-01-01T00:00:00+01" are inserted as strings, while in other officially supported drivers they would be inserted as a type with a time zone. This is blocked by Add fully fledged time zones support #200
There is no automatically generated primary key column. This is because ClickHouse doesn't support an auto-incrementing type like SERIAL in PostgreSQL.

Checklist

Delete items not relevant to your PR:

Unit and integration tests covering the common scenarios were added
A human-readable description of the changes was provided to include in CHANGELOG
For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

CLAassistant · 2024-05-10T19:29:55Z

All committers have signed the CLA.

calherries · 2024-05-10T19:34:39Z

src/metabase/driver/clickhouse.clj

+ (let [sql (create-table!-sql driver table-name column-definitions :primary-key primary-key)]
+ (qp.writeback/execute-write-sql! db-id sql)))
+
+(defmethod driver/insert-into! :clickhouse


Rather than use the default sql-jdbc implementation, I've chosen to use the approach recommended here, to allow for large uploads.

src/metabase/driver/clickhouse.clj

crisptrutski

LGTM, for what it's worth

crisptrutski · 2024-05-14T14:13:00Z

src/metabase/driver/clickhouse.clj

+ (doseq [row values]
+ (when (seq row)
+ (doseq [[idx v] (map-indexed (fn [x y] [(inc x) y]) row)]
+ (condp isa? (type v)


Probably not a bottleneck, but making a protocol like (ps-set! v ps idx) could remove the reflection of type and the O(n) branching of condp.

Yes, how do you imagine this being implemented exactly? The tricky part is this multimethod doesn't know the type of each column without using reflection, and values can be nil. I'm imagining this:

for each value, call (ps-set! v ps idx)

px-set! looks up the method to use for the idx in a volatile map, e.g. setString etc. If the map contains a method for the idx, it uses it. Otherwise calculate the method using a similar condp expression and save it to the map under the key idx, and execute the method.

That way we'll only use reflection once for every column.

Maybe I'm missing something, but I was thinking we'd just implement the protocol directly on each of these java types for v, eschewing any reflection or map look-up. I think you can just use Object for the fallback case.

Right, anyway this seems pretty inconsequential compared to all the slow stuff we're doing on the Metabase side. I think I'd prefer to keep it simple and not change this condp for now

src/metabase/driver/clickhouse.clj

calherries · 2024-05-15T02:35:35Z

@slvrtrn I have a few questions:
(1) Is there a way to run tests with ClickHouse Cloud deployments?
(2) Is there an alternative way to determine the number of replicas without using system.macros, as you suggested here? I can create a ClickHouse cloud user without sufficient privileges to select from the system.macros table.

Executing this query

SELECT count(*) AS nodes_count FROM system.clusters c JOIN system.macros m ON c.cluster = m.substitution

gives me this error

> Not enough privileges. To execute this query, it's necessary to have the grant SELECT(substitution) ON system.macros

(3) If there's no solution to (2), would you be willing to accept this contribution if it only worked for ClickHouse Cloud in the meantime? That's the deployment type we care about most. With this commit I've updated the PR so we only support ClickHouse Cloud.

slvrtrn · 2024-05-15T09:06:16Z

(1) — You can provide the hostname and port via the env variables; IIRC, it won't work correctly unless you add wait_end_of_query=1 to the JDBC connection string, cause it will not wait until the tables are created on all replicas, and the driver might immediately call the wrong one.

(2) - I am only aware of the mentioned method

I can create a ClickHouse cloud user without sufficient privileges to select from the system.macros table.

You don't need this in CH Cloud. It's necessary only for on-premise. But yes, these grants are one of the main pain points.

(3) It will work for single-node and ClickHouse Cloud, then. But we need a way to disable it for the on-premise cluster if it's detected. Can we do that dynamically? It's the feature that is enabled or disabled on the driver level, not on the connection level.

calherries · 2024-05-15T13:13:57Z

(1) Thanks, I can run the tests locally with ClickHouse Cloud. I should have been more specific with my question: Is there a way to run the tests that run with Github Actions with ClickHouse Cloud deployments?

(2) Ok thanks for clarifying

(3) How can we know whether it's an on-premise cluster vs on-premise single node without getting the node_count? You only suggested an approach using node_count in your comment here.

is_cloud = 0, nodes_count = 1 -> on-premise single node

src/metabase/driver/clickhouse.clj

calherries · 2024-05-15T18:24:14Z

@slvrtrn tagging you because you might have missed my last comment above, where I didn't tag you

slvrtrn · 2024-05-15T19:16:04Z

@calherries
(1) - currently, no. We need to add this; I will do that as a follow-up (after both this and #232 are merged).
(3) - this is the best option I am aware of. I will clarify if there is anything else available, but chances are that that's it. The grants issue is annoying, but it's necessary only for on-premise deployments (and not CH Cloud, as you don't need this query there); we can add a doc entry about the required grants (after we add on-prem clusters support as a follow-up PR).

I have another question, though. Is there a way to disable a particular feature for a specific connection? These are defined in the driver/database-supports? override, but a cleaner solution would've been to enable it based on the deployment type (i.e. after we have the connection details). This is useful in this PR atm and also in #232 cause the impersonation (another feature defined via an override) should be enabled only on the most recent CH version.

calherries · 2024-05-16T03:56:34Z

(3) Okay, for now I'll support CH Cloud only.

Is there a way to disable a particular feature for a specific connection?

Yes, we generally use driver/database-supports? for this, you can have the result depend on the version or the connection details. See this example with MongoDB. I've also done a similar thing for this PR in the last commit 6b8935b.

I think this PR is ready to be reviewed now, but we'll need to wait for the changes in this PR to be released in the 49 branch before there is a released Metabase version that is compatible with this.

calherries · 2024-05-16T04:02:54Z

For this:

A human-readable description of the changes was provided to include in CHANGELOG

What version should these changes under in the changelog?

slvrtrn · 2024-05-16T12:57:55Z

What version should these changes under in the changelog?

I will add this, don't worry. Otherwise, there will be conflicts with #232

you can have the result depend on the version or the connection details. See this example with MongoDB. I've also done a similar thing for this PR in the last commit 6b8935b.

Thanks, this is great.

slvrtrn · 2024-05-16T12:59:29Z

src/metabase/driver/clickhouse.clj

+ :quoted true
+ :dialect (sql.qp/quote-style driver)))
+ "ENGINE = MergeTree"
+ (format "PRIMARY KEY (%s)" (str/join ", " (map quote-name primary-key)))


You can drop this and put the primary-key in the ORDER BY clause instead.

i.e. the table definition should be just:

CREATE TABLE foo (col1 Type, ...) ENGINE MergeTree ORDER BY (primary-key)

See also: https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree#order_by

ClickHouse uses the sorting key as a primary key if the primary key is not defined explicitly by the PRIMARY KEY clause.

Nice, I've made that change Use order by to specify primary key

slvrtrn · 2024-05-16T13:06:22Z

btw

Resolves #230

It doesn't; IIUC, the user that requested this feature, uses on-prem.

slvrtrn · 2024-05-16T13:31:53Z

@calherries, some tests are failing on the CI.

There is one more thing.
In my issue comment, I mentioned

To immediately get the data from any replica regardless of its sync status, it’s also better to add select_sequential_consistency=1. This will guarantee that immediately after the insertion, we can query any node in the cluster and get the data back.

This is an absolute necessity if we add the Cloud test run cause we need to immediately get the test data back after the table is created; otherwise, we will experience weird flakiness during the runs. This also can happen in real-world usage (albeit more rarely).

calherries · 2024-05-16T23:12:23Z

@slvrtrn

select_sequential_consistency=1.

I struggled with this, because we're using a connection pool and don't have any way to set settings in a SQL query. I've made an attempt here but I'm concerned about the performance implications of having this set for every query, even if it's only needed for queries on uploaded tables with CH Cloud.
set select_sequential_consistency in connection details

update: I've implemented a workaround with Avoid setting select_sequential_consistency for on-premise

crisptrutski · 2024-05-17T12:20:27Z

src/metabase/driver/clickhouse.clj

+(defmethod sql-jdbc.conn/connection-details->spec :clickhouse
+ [_ details]
+ (cond-> (connection-details->spec* details)
+ (cloud? details)


From what I understand, select_sequential_consistency just limits which replica you query from, and doesn't affect the performance of the query on that specific replica.

This logic seems like it will break consistency for multi-replica on prem installations, and needlessly disable it on single replica ones. It seems best to just to enable it unconditionally for now?

For any multi-replica installation, enabling it for all connections seems to wipe out most of the benefits of connecting to a distributed installation, but I don't see any way around this besides reworking metabase to support separate pools for different connection flavors.

I think you're mistaking the behaviour of select_sequential_consistency on CH Cloud (docs here), thinking it's the same as on-premise (the link you shared). We're only enabling select_sequential_consistency on CH Cloud, where it's much less of an issue.

There's more on SharedMergeTree in this blog post.

Unlike ReplicatedMergeTree, SharedMergeTree doesn't require replicas to communicate with each other. Instead, all communication happens through shared storage and clickhouse-keeper. SharedMergeTree implements asynchronous leaderless replication and uses clickhouse-keeper for coordination and metadata storage. This means that metadata doesn’t need to be replicated as your service scales up and down. This leads to faster replication, mutation, merges and scale-up operations. SharedMergeTree allows for hundreds of replicas for each table, making it possible to dynamically scale without shards. A distributed query execution approach is used in ClickHouse Cloud to utilize more compute resources for a query.

If your use case requires consistency guarantees that each server is delivering the same query result, then you can run the SYNC REPLICA system statement, which is a much more lightweight operation with the SharedMergeTree. Instead of syncing data (or metadata with zero-copy replication) between servers, each server just needs to fetch the current version of metadata from Keeper.

And I can't imagine fetching metadata from Keeper can be a bottleneck. I also got this DM from Serge on select_sequential_consistency:

With CH Cloud, there should not be any issues with that, as it uses “shared” merge tree and this is merely a metadata sync before executing the query (this is super fast, as it is a fetch of the meta from zookeeper).

Yes, select_sequential_consistency is correctly enabled for the Cloud type of instance only.

For on-premise clusters (in a follow-up PR), we should just set insert_quorum equal to the nodes count so that replica sync won't be necessary and will be effectively done during the CSV insert, and it won't affect other read-only operations.

slvrtrn · 2024-05-17T15:42:49Z

Let's bump the MB version to 0.49.11 in .github/workflows/check.yml#L20 and merge if the CI is green.

calherries marked this pull request as draft May 10, 2024 19:30

CSV uploads

2b26552

calherries force-pushed the csv-uploads branch from 81cc54a to 2b26552 Compare May 10, 2024 19:31

calherries commented May 10, 2024

View reviewed changes

~ move function

7ad64a0

slvrtrn reviewed May 13, 2024

View reviewed changes

src/metabase/driver/clickhouse.clj Show resolved Hide resolved

slvrtrn reviewed May 13, 2024

View reviewed changes

src/metabase/driver/clickhouse.clj Show resolved Hide resolved

crisptrutski reviewed May 14, 2024

View reviewed changes

Set wait_end_of_query=1 for clickhouse cloud when creating the table

7fb8206

calherries commented May 15, 2024

View reviewed changes

src/metabase/driver/clickhouse.clj Outdated Show resolved Hide resolved

calherries mentioned this pull request May 15, 2024

[Epic] Clickhouse upload CSV support for Cloud metabase/metabase#42294

Open

calherries marked this pull request as ready for review May 16, 2024 02:00

calherries marked this pull request as draft May 16, 2024 02:00

calherries added 2 commits May 15, 2024 21:34

Only support uploads for CH Cloud DBs

6b8935b

Fix reflection warnings

d1c025d

calherries marked this pull request as ready for review May 16, 2024 04:00

slvrtrn reviewed May 16, 2024

View reviewed changes

slvrtrn approved these changes May 16, 2024

View reviewed changes

calherries changed the title ~~CSV uploads~~ CSV uploads for ClickHouse Cloud May 16, 2024

slvrtrn mentioned this pull request May 16, 2024

Uploads CSV to Clickhouse #230

Open

3 tasks

calherries added 2 commits May 16, 2024 14:00

Use order by to specify primary key

2cec411

set select_sequential_consistency in connection details

a51c32e

calherries added 2 commits May 16, 2024 21:34

Avoid setting select_sequential_consistency for on-premise

847b80f

Remove unused imports

5554a61

crisptrutski reviewed May 17, 2024

View reviewed changes

calherries added 2 commits May 17, 2024 09:49

~ 49.11

eb3cdc7

~ fix test and add test for select_sequential_consistency

490a558

slvrtrn merged commit 409984e into ClickHouse:master May 17, 2024
1 check passed

This was referenced May 17, 2024

Use db argument in driver/database-supports? for uploads metabase/metabase#42851

Merged

ClickHouse Cloud Uploads: Appending duplicate CSV file doesn't append new values #237

Closed

CSV Uploads: catch failed connections and fix duplicate inserts #238

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV uploads for ClickHouse Cloud #236

CSV uploads for ClickHouse Cloud #236

calherries commented May 10, 2024 •

edited

CLAassistant commented May 10, 2024 •

edited

calherries May 10, 2024

crisptrutski left a comment

crisptrutski May 14, 2024

calherries May 14, 2024

crisptrutski May 14, 2024

calherries May 16, 2024

calherries commented May 15, 2024 •

edited

slvrtrn commented May 15, 2024

calherries commented May 15, 2024 •

edited

calherries commented May 15, 2024

slvrtrn commented May 15, 2024

calherries commented May 16, 2024 •

edited

calherries commented May 16, 2024

slvrtrn commented May 16, 2024

slvrtrn May 16, 2024

slvrtrn May 16, 2024 •

edited

calherries May 16, 2024

slvrtrn commented May 16, 2024

slvrtrn commented May 16, 2024

calherries commented May 16, 2024 •

edited

crisptrutski May 17, 2024

calherries May 17, 2024 •

edited

slvrtrn May 17, 2024

slvrtrn commented May 17, 2024

CSV uploads for ClickHouse Cloud #236

CSV uploads for ClickHouse Cloud #236

Conversation

calherries commented May 10, 2024 • edited

Summary

Checklist

CLAassistant commented May 10, 2024 • edited

Choose a reason for hiding this comment

crisptrutski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

calherries commented May 15, 2024 • edited

slvrtrn commented May 15, 2024

calherries commented May 15, 2024 • edited

calherries commented May 15, 2024

slvrtrn commented May 15, 2024

calherries commented May 16, 2024 • edited

calherries commented May 16, 2024

slvrtrn commented May 16, 2024

Choose a reason for hiding this comment

slvrtrn May 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slvrtrn commented May 16, 2024

slvrtrn commented May 16, 2024

calherries commented May 16, 2024 • edited

Choose a reason for hiding this comment

calherries May 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slvrtrn commented May 17, 2024

calherries commented May 10, 2024 •

edited

CLAassistant commented May 10, 2024 •

edited

calherries commented May 15, 2024 •

edited

calherries commented May 15, 2024 •

edited

calherries commented May 16, 2024 •

edited

slvrtrn May 16, 2024 •

edited

calherries commented May 16, 2024 •

edited

calherries May 17, 2024 •

edited