Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploads CSV to Clickhouse #230

Open
1 of 3 tasks
tritatuy opened this issue Mar 20, 2024 · 8 comments
Open
1 of 3 tasks

Uploads CSV to Clickhouse #230

tritatuy opened this issue Mar 20, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@tritatuy
Copy link

tritatuy commented Mar 20, 2024

(Edited by @slvrtrn to keep track of the progress of the feature implementation).

CSV uploads feature needs to support the following ClickHouse deployment types:

@tritatuy tritatuy added the enhancement New feature or request label Mar 20, 2024
@slvrtrn
Copy link
Collaborator

slvrtrn commented Mar 20, 2024

Have you considered uploading CSV with specialized tools such as DBeaver, which will provide much more flexibility? Then, just re-sync the schema in Metabase.

Additionally, it is preferable to have read-only user profiles for the BI tools. Uploading a CSV will require Metabase to be connected via a non-read-only profile.

@tritatuy
Copy link
Author

tritatuy commented Apr 3, 2024

DBeaver

Thanks for your answer!
We would like to have this feature for business users also, because there are a lot of cases when they need to upload some csv data and work with them combined with data from DWH. And DBeaver or some other tool is not convinient.

Non-read-only is also not a problem, because we can give Metabase access only to our Sandbox base on Clickhouse's host.
So my request is to extend this feature to Clickhouse also, not only MySQL or PostgreSQL.

@slvrtrn
Copy link
Collaborator

slvrtrn commented Apr 3, 2024

This is a bit non-trivial due to these caveats:

  • We will need to introspect the ClickHouse deployment type (cluster/non-cluster); on-premise cluster DDLs are different.
  • The primary key for an arbitrary CSV file - just line number?
  • What table engine to use? On a cluster, it needs to be ReplicatedMergeTree, i.e., we also need to introspect the macro values to substitute for the DDL.

To be fair, I am very hesitant to implement this, given that other well-known tools do that with specialized interfaces which will yield better results.

@tritatuy
Copy link
Author

tritatuy commented Apr 5, 2024

Thanks for your feedback. My suggestions are:

  1. Allow this feature only for cluster deployment.
  2. To be honest - have no idea) But maybe line number is good enough. Actually Metabase doesn't ask about primary key even for uploading CSV to MySQL or PostgreSQL DBs. So it means that the solution exists.
  3. Create 2 tables: first one's engine is ReplicatedMergeTree and the second one is the Distributed table watching on the first table. And that distributed one is the table which Metabase's result model will query to.

Something like that:

CREATE TABLE IF NOT EXISTS sandbox.file_name ON CLUSTER clickhouse_cluster
(
 col_name_1 type,
 ...,
 col_name_n type
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/sandbox.file_name', '{replica}')
--...

CREATE TABLE IF NOT EXISTS sandbox.file_name_dist ON CLUSTER clickhouse_cluster
(
 col_name_1 type,
 ...,
 col_name_n type
  )
ENGINE = Distributed(clickhouse_cluster, sandbox, file_name)

Of course we as analysts can do it in DataGrip, PyCharm and so on. But the key feature of Metabase is self-service and we want our business users to be able to upload their on data without calling developers and analysts, because they are always a bottle-neck.

@mshustov
Copy link
Member

mshustov commented Apr 9, 2024

According to the docs, only 2 connectors support Uploading functionality - PostgreSQL and MySQL.
@tritatuy, we might consider implementing the feature request, provided there is demand from the user base. Let's collect some feedback on the current issue.
If you are up to contribute to the project, @slvrtrn might give you some pointers and hints for implementing the functionality.

@tritatuy
Copy link
Author

According to the docs, only 2 connectors support Uploading functionality - PostgreSQL and MySQL. @tritatuy, we might consider implementing the feature request, provided there is demand from the user base. Let's collect some feedback on the current issue.

Ok, thanks! Let's wait for new requests.

@slvrtrn
Copy link
Collaborator

slvrtrn commented Apr 17, 2024

I'd like to add a bit more details here.

We have three different deployment types that we want to support:

  • on-premise single node (or just a local docker, as the CH driver uses)
  • on-premise cluster (also possible with a local docker; there is a docker-compose in the driver repo, too)
  • ClickHouse Cloud.

The DDLs for the tables created and specific required settings for the data insertion vary significantly based on the deployment types.

Here’s what needs to be done by the driver if we want to properly create a table and insert something there, supporting all three scenarios:

Prerequisites

  1. Introspect if we have more than 1 node
SELECT count(*) AS nodes_count FROM system.clusters c JOIN system.macros m ON c.cluster = m.substitution
  1. Introspect if this is ClickHouse Cloud (it returns 1 when it is ClickHouse Cloud)
SELECT value AS is_cloud FROM system.settings WHERE name='cloud_mode'
  1. Now, it is enough to decide what path we are taking when dealing with the inserts.
  • is_cloud = 0, nodes_count = 1 -> on-premise single node (with a fair amount of certainty)
  • is_cloud = 0, node_count > 1 -> on premise cluster (100%)
  • is_cloud = 1, ignore node_count -> CH Cloud

Connection/DDL/insert variants

On-premise single node

DDL:

CREATE TABLE csv_upload (...) ENGINE MergeTree ORDER BY (...)

Insert: standard JDBC methods from MB will probably work.

On-premise cluster

This is the most complicated scenario.

  1. Introspect macro values. We need all three.
SELECT getMacro('cluster'), getMacro('replica'), getMacro('shard')
  1. Obtain the JDBC connection with two extra parameters: wait_end_of_query=1, so that we wait until the table is created on every node on the cluster, and insert_quorum=nodes_count, which we got from one of the queries from the prerequisites.
  2. DDL is entirely different from a single-node setup
CREATE TABLE csv_upload ON CLUSTER '{cluster}'
(...)
ENGINE ReplicatedMergeTree(
  '/clickhouse/{cluster}/tables/{database}/{table}/{shard}',
  '{replica}'
)
ORDER BY (...)
  1. Insert as usual using the default JDBC methods from the MB sources. insert_quorum will ensure the data ends up on every node without relying on eventual consistency.

CH Cloud

This is mostly similar to the on-premise single node (CH Cloud takes care of cluster macro, etc, in the DDL).

  1. Obtain the JDBC connection with at least one extra parameter, wait_end_of_query=1, so that we wait until the table is created on every node on the cluster. We do not rely on quorum in the Cloud. However, to immediately get the data from any replica regardless of its sync status, it’s better also to add select_sequential_consistency=1. This will guarantee that immediately after the insertion, we can query any node in the cluster and get the data back.
  2. DDL is similar to the single-node deployment:
CREATE TABLE csv_upload (...) ENGINE MergeTree ORDER BY (...)
  1. Insert as usual using the default JDBC methods from the MB sources.

I don't have time to implement all these scenarios, and I don't feel comfortable implementing support for only one deployment type.

However, if someone from the community wishes to take on this task, I can help with the setup, statements, etc. - you could always ping me here or directly in the community Slack.

@slvrtrn
Copy link
Collaborator

slvrtrn commented May 28, 2024

1.5.0 supports CSV uploads with ClickHouse Cloud (props to @calherries).
I will add proper support for on-premise clusters after 0.50.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants