feat(influx_tools): Add export to parquet files #25297

srebhan · 2024-09-09T11:56:09Z

Closes #
Superseeds #25253

Describe your proposed changes here.

I've read the contributing section of the project README.
Signed CLA (if not already signed).

This PR adds a command to export data into per-shard parquet files. To do so, the command iterates over the shards, creates a cumulative schema over the series of a measurement (i.e. a super-set of tags and fields) and exports the data to a parquet file per measurement and shard.

To test the tool run

go run -ldflags "-X google.golang.org/protobuf/reflect/protoregistry.conflictPolicy=ignore" ./cmd/influx_tools/ export-parquet -config influxdb.conf -database telegraf

.circleci/config.yml

cmd/influx_tools/main.go

cmd/influx_tools/parquet/batcher.go

cmd/influx_tools/parquet/command.go

cmd/influx_tools/parquet/exporter.go

davidby-influx

I did a quick review, but I'm not familiar with arrow and certainly missed some things. I can do a more thorough review if we paired to walk through the algorithm once.

cmd/influx_tools/parquet/schema.go

srebhan · 2024-09-18T13:49:05Z

@davidby-influx thanks for the thorough review! I tried to address all issues and commented on the three unresolved ones. Will schedule a meeting for walking through the code. Thanks again!

cmd/influx_tools/parquet/batcher.go

davidby-influx

LGTM

cmd/influx_tools/parquet/cursors.go

alespour · 2024-10-02T11:13:19Z

~~I'm not sure what to make of this: I have v1 db with several measurements like cpu disk etc, each with ~8M rows~~

> select count(usage_user) from cpu name: cpu time count ---- ----- 0 8631360

The same query returns different number of rows in exported Parquet "db":

alespour@master-node:/bigdata/x$ duckdb -column -s "select count(usage_user) from 'all/cpu-*.parquet'" count(usage_user) ----------------- 28771200

Log attached.
cpu-export.log

alespour · 2024-10-02T14:06:08Z

tested measurement without tags - OK
tested single & all measurements export - OK, except the discrepancy of number of rows

Tested with db with simulating 1-month of monitoring data of a small data center (9 measurements like cpu, disk etc, 10 tags). DB files size on disk 4.1 GB, 5 shards.

Exported Parquet size on disk 11 GB, took 1h6m on somewhat obsolete laptop (Core i7 CPU, 8-core, 16 GB RAM, SSD). Memory usage during export was stable (RSS peak ~2 GB).

InfluxDB measuement structure example:

> show tag keys from cpu
name: cpu
tagKey
------
arch
datacenter
hostname
os
rack
region
service
service_environment
service_version
team

> show field keys from cpu
name: cpu
fieldKey         fieldType
--------         ---------
usage_guest      float
usage_guest_nice float
usage_idle       float
usage_iowait     float
usage_irq        float
usage_nice       float
usage_softirq    float
usage_steal      float
usage_system     float
usage_user       float

Parquet:

alespour@master-node:/bigdata/x$ duckdb -column -s "describe select * from 'all/cpu-*.parquet'"

column_name          column_type  null  key  default  extra
-------------------  -----------  ----  ---  -------  -----
time                 TIMESTAMP    YES                      
arch                 VARCHAR      YES                      
datacenter           VARCHAR      YES                      
hostname             VARCHAR      YES                      
os                   VARCHAR      YES                      
rack                 VARCHAR      YES                      
region               VARCHAR      YES                      
service              VARCHAR      YES                      
service_environment  VARCHAR      YES                      
service_version      VARCHAR      YES                      
team                 VARCHAR      YES                      
usage_guest          DOUBLE       YES                      
usage_guest_nice     DOUBLE       YES                      
usage_idle           DOUBLE       YES                      
usage_iowait         DOUBLE       YES                      
usage_irq            DOUBLE       YES                      
usage_nice           DOUBLE       YES                      
usage_softirq        DOUBLE       YES                      
usage_steal          DOUBLE       YES                      
usage_system         DOUBLE       YES                      
usage_user           DOUBLE       YES

Measurement without tags:

alespour@master-node:/bigdata/x$ duckdb -column -s "select * from 'notags/*.parquet'"
time                        lat    lon  
--------------------------  -----  -----
2024-10-02 13:03:55.643371  49.95  14.47
2024-10-02 13:04:04.423014  49.91  14.49
2024-10-02 13:04:12.726653  49.94  14.53

alespour · 2024-10-02T14:19:02Z

I will repeat the test to verify the number of rows (mis)match.

alespour · 2024-10-02T18:09:54Z

My apologies, it was a mistake on my side. Row count matches.

InfluxDB:

> select count(usage_user) from cpu
name: cpu
time count
---- -----
0    28771200

Parquet:

alespour@master-node:/bigdata/x$ duckdb -column -s "select count(usage_user) from 'cpu/*.parquet'"
count(usage_user)
-----------------
28771200

alespour · 2024-10-03T08:08:49Z

tested other types - OK

Creating the following schemata for 1 measurement(s):
  Measurement "types" with 0 tag(s) and  5 field(s):
    Column	Kind		Datatype
    ------	----		--------
    time	timestamp	timestamp (nanosecond)
    label	field		string
    lat		field		float
    lon		field		float
    match	field		boolean
    scale	field		integer

alespour@master-node:/bigdata/x$ sudo duckdb -column -s "describe from 'types/*.parquet'"
column_name  column_type  null  key  default  extra
-----------  -----------  ----  ---  -------  -----
time         TIMESTAMP    YES                      
label        VARCHAR      YES                      
lat          DOUBLE       YES                      
lon          DOUBLE       YES                      
match        BOOLEAN      YES                      
scale        BIGINT       YES

alespour@master-node:/bigdata/x$ sudo duckdb -column -s "select * from 'types/*.parquet' limit 1"
time                        label  lat    lon    match  scale
--------------------------  -----  -----  -----  -----  -----
2024-10-03 07:58:33.419431  a1     49.94  14.53  true   4

alespour · 2024-10-03T08:13:15Z

It's GTG by me 👍

srebhan · 2024-11-18T13:49:29Z

To run the exporter in this PR do the following (assuming you are using a BASH-compatible shell)

Clone the repo and checkout the PR

# git clone https://github.com/influxdata/influxdb.git
# cd influxdb/
# git fetch origin pull/25297/head:v1-bulk-exporter-parquet 
# git checkout v1-bulk-exporter-parquet

Build InfluxDB v1

# export PKG_CONFIG=${PWD}/pkg-config.sh
# go build ./...

Run the exporter (with the help flag)

# go run -ldflags "-X google.golang.org/protobuf/reflect/protoregistry.conflictPolicy=ignore" ./cmd/influx_tools/ export-parquet --help

Run the exporter with the config of an existing server instance

# go run -ldflags "-X google.golang.org/protobuf/reflect/protoregistry.conflictPolicy=ignore" ./cmd/influx_tools/ export-parquet -config <path to influxdb config dir>/influxdb.conf -database <database to export>

dburton-influxdata · 2024-11-18T22:12:13Z

Do we have a compiled version to test with or do I still need to clone the repo and build the go binary?

dburton-influxdata · 2024-11-18T22:18:45Z

I converted all of the BASH commands into a Python script and ran. It generates an error during the build.
exporter_build_errors_python exporter_script.txt

dburton-influxdata · 2024-11-18T22:19:48Z

Here is the Python script in Zip format for Github.
exporter_script.zip

srebhan · 2024-11-21T21:41:04Z

@dburton-influxdata using os.environ does NOT export the variable to subprocesses like the go command! You would need to use os.putenv but I don't understand why you need to use python for the whole thing...

jwei-influx · 2024-12-19T17:39:12Z

I'll be taking this over from Darren. A couple of questions regarding this tool I do have:

Has there been any consideration about how the tool is intended to handle mixed field type shards?
If we need to do any sort of custom partitioning on the eventual 3.0 system, are we able to do that with this tool? Or conversely, is the resulting parquet file from this tool able to be slotted in behind a custom partitioning scheme that is pre-applied to the 3.0 instance?
Are we able to do any sort of manipulation of the tags and fields using this tool, or potentially by editing the resulting parquet files?
Are we able to use this for backloading processes? ie: slotting the resulting parquet files into an existing database that's receiving the real-time dual-written feed from the original 1.x system

I might have more questions as I test the tool, but these are the ones that are top of mind for me right now.

srebhan · 2024-12-20T10:10:02Z

Thanks for your investigations @jwei-influx! Let me answer your questions:

Has there been any consideration about how the tool is intended to handle mixed field type shards?

Yes. There are the --resolve-types and --resolve-names command-line options to fix type conflicts and name conflicts (between tags and fields) respectively.

If we need to do any sort of custom partitioning on the eventual 3.0 system, are we able to do that with this tool? Or conversely, is the resulting parquet file from this tool able to be slotted in behind a custom partitioning scheme that is pre-applied to the 3.0 instance?

This is beyond the scope of this tool. This tool outputs the data as-is (plus potential type and name changes as discussed above) without the ability to split, merge or modify data.

Are we able to do any sort of manipulation of the tags and fields using this tool, or potentially by editing the resulting parquet files?

As I mentioned above, you can modify field names (and types) using the --resolve-types and --resolve-names command-line options. Beyond this, the tool is not intended for manipulating or editing data or schemata but for exporting of existing data!

Are we able to use this for backloading processes? ie: slotting the resulting parquet files into an existing database that's receiving the real-time dual-written feed from the original 1.x system

This is a question you should ask the authors of the import tool that takes the parquet files generated by this tool! Maybe @jacobmarble can answer this question or knows who could answer this...

I might have more questions as I test the tool, but these are the ones that are top of mind for me right now.

Happy to answer them. ;-)

jacobmarble · 2024-12-23T18:48:24Z

Are we able to use this for backloading processes? ie: slotting the resulting parquet files into an existing database that's receiving the real-time dual-written feed from the original 1.x system

This is a question you should ask the authors of the import tool that takes the parquet files generated by this tool! Maybe @jacobmarble can answer this question or knows who could answer this...

I haven't looked closely at this PR, and I'm no longer managing the team that is working on Parquet import. @helenosheaa might direct the right person to answer thoughtfully and accurately.

barbaranelson · 2025-01-30T00:05:48Z

Adding a reference to the bulk import doc, in case it's useful:
https://docs.google.com/document/d/1-KUAKLAWfzHmEVRPKKF-jo7YL2gLEFT1MXa5tqyFm-4/edit?usp=sharing

helenosheaa · 2025-01-30T16:00:59Z

I missed the @ from jacob here.

So this is tooling so that you can export from v1 to parquet and the question is whether that will line up with the bulk_ingester which takes parquet files - there's docs of how the bulk_ingester currently functions here and its restrictions

@carols10cents or @stuartcarnie might be good additional reviewers if needed

carols10cents · 2025-01-31T14:57:37Z

7. Are we able to use this for backloading processes? ie: slotting the resulting parquet files into an existing database that's receiving the real-time dual-written feed from the original 1.x system

This is a question you should ask the authors of the import tool that takes the parquet files generated by this tool! Maybe @jacobmarble can answer this question or knows who could answer this...

I haven't looked closely at this PR, and I'm no longer managing the team that is working on Parquet import. @helenosheaa might direct the right person to answer thoughtfully and accurately.

The bulk_ingester tool is able to import data into tables that are also being written to directly; the compactor will reorganize and deduplicate as needed. The bulk_ingester also repartitions data according to the partition template set on the destination table. In other words, it rewrites the Parquet files during the import, which does take some time-- the Parquet files are not added to the system directly as they are.

Please let me know if you have any other questions!

srebhan · 2025-02-26T10:58:39Z

@jwei-influx I rebased the PR and fixed the help. You can now checkout this PR and then just run

# go build -ldflags "-X google.golang.org/protobuf/reflect/protoregistry.conflictPolicy=ignore" ./cmd/influx_tools/

to build the influx_tools binary. Afterwards you can check the help using

# ./influx_tools export-parquet --help

and experiment with the tool like

# ./influx_tools export-parquet -config <path to influxdb config dir>/influxdb.conf -database <database to export>

Let me know if there are any issues!

srebhan force-pushed the v1-bulk-exporter-parquet branch 2 times, most recently from 6869ba3 to bd44db9 Compare September 9, 2024 14:12

srebhan force-pushed the v1-bulk-exporter-parquet branch from 7c930bb to 2bb73ce Compare September 17, 2024 19:39

davidby-influx reviewed Sep 18, 2024

View reviewed changes

cmd/influx_tools/parquet/exporter.go Outdated Show resolved Hide resolved

davidby-influx reviewed Sep 18, 2024

View reviewed changes

cmd/influx_tools/parquet/schema.go Outdated Show resolved Hide resolved

cmd/influx_tools/parquet/schema.go Outdated Show resolved Hide resolved

srebhan force-pushed the v1-bulk-exporter-parquet branch from 2bb73ce to 46aef0b Compare September 18, 2024 10:41

davidby-influx assigned srebhan Sep 18, 2024

davidby-influx reviewed Sep 18, 2024

View reviewed changes

cmd/influx_tools/parquet/batcher.go Outdated Show resolved Hide resolved

srebhan force-pushed the v1-bulk-exporter-parquet branch 2 times, most recently from d7216ca to a7d0f1b Compare September 19, 2024 20:03

davidby-influx previously approved these changes Sep 19, 2024

View reviewed changes

cmd/influx_tools/parquet/cursors.go Outdated Show resolved Hide resolved

srebhan mentioned this pull request Oct 2, 2024

feat: influx_tools export parquet #25253

Closed

2 tasks

srebhan added 14 commits February 26, 2025 11:27

feat(influx_tools): Add export to parquet files

79d259c

chore: Wrap errors in influx_tools main

8549ccf

chore: Do not create unused series cursor and simplify batcher creation

90b08f8

chore: Move converter creation to batcher as it is only used there

1b4eb5f

fix: Caputure error when closing series cursor

5161051

feat: Print shard series-file path on error

e048d3e

chore: Replace panic by returning an error

3c375e5

feat: Use logger instead of raw printing

85b2a11

fix: Caputure error when closing exporter

9f17523

fix: Caputure more defer errors

15ab1a5

feat: Detect name conflicts after name resolution

7089359

fix: Make sure deferred functions are actually called

a231bd4

feat: Move out cursor handling

b87b349

feat: Preallocate maps and slices

f3b68fc

srebhan dismissed davidby-influx’s stale review via f3b68fc February 26, 2025 10:32

srebhan force-pushed the v1-bulk-exporter-parquet branch from a7d0f1b to f3b68fc Compare February 26, 2025 10:32

fix: Make help accessible for command

dc7da40

srebhan force-pushed the v1-bulk-exporter-parquet branch from e123f92 to dc7da40 Compare February 26, 2025 10:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(influx_tools): Add export to parquet files #25297

feat(influx_tools): Add export to parquet files #25297

srebhan commented Sep 9, 2024 •

edited

Loading

davidby-influx left a comment

srebhan commented Sep 18, 2024

davidby-influx left a comment

alespour commented Oct 2, 2024 •

edited

Loading

alespour commented Oct 2, 2024 •

edited

Loading

alespour commented Oct 2, 2024

alespour commented Oct 2, 2024

alespour commented Oct 3, 2024 •

edited

Loading

alespour commented Oct 3, 2024

srebhan commented Nov 18, 2024 •

edited

Loading

dburton-influxdata commented Nov 18, 2024

dburton-influxdata commented Nov 18, 2024

dburton-influxdata commented Nov 18, 2024

srebhan commented Nov 21, 2024

jwei-influx commented Dec 19, 2024

srebhan commented Dec 20, 2024

jacobmarble commented Dec 23, 2024

barbaranelson commented Jan 30, 2025

helenosheaa commented Jan 30, 2025

carols10cents commented Jan 31, 2025

srebhan commented Feb 26, 2025

feat(influx_tools): Add export to parquet files #25297

Are you sure you want to change the base?

feat(influx_tools): Add export to parquet files #25297

Conversation

srebhan commented Sep 9, 2024 • edited Loading

davidby-influx left a comment

Choose a reason for hiding this comment

srebhan commented Sep 18, 2024

davidby-influx left a comment

Choose a reason for hiding this comment

alespour commented Oct 2, 2024 • edited Loading

alespour commented Oct 2, 2024 • edited Loading

alespour commented Oct 2, 2024

alespour commented Oct 2, 2024

alespour commented Oct 3, 2024 • edited Loading

alespour commented Oct 3, 2024

srebhan commented Nov 18, 2024 • edited Loading

dburton-influxdata commented Nov 18, 2024

dburton-influxdata commented Nov 18, 2024

dburton-influxdata commented Nov 18, 2024

srebhan commented Nov 21, 2024

jwei-influx commented Dec 19, 2024

srebhan commented Dec 20, 2024

jacobmarble commented Dec 23, 2024

barbaranelson commented Jan 30, 2025

helenosheaa commented Jan 30, 2025

carols10cents commented Jan 31, 2025

srebhan commented Feb 26, 2025

srebhan commented Sep 9, 2024 •

edited

Loading

alespour commented Oct 2, 2024 •

edited

Loading

alespour commented Oct 2, 2024 •

edited

Loading

alespour commented Oct 3, 2024 •

edited

Loading

srebhan commented Nov 18, 2024 •

edited

Loading