feat: influx_tools export parquet #25253

alespour · 2024-08-20T08:56:38Z

Test run like

go run -ldflags "-X google.golang.org/protobuf/reflect/protoregistry.conflictPolicy=ignore" ./cmd/influx_tools/ export-parquet -config /bigdata/influxdb-copy/influxdb.conf -no-conflict-path -database telegraf -measurement cpu

Closes #

Describe your proposed changes here.

I've read the contributing section of the project README.
Signed CLA (if not already signed).

stuartcarnie

I've added a suggestion to change how schema is gathered and to change the processing so that it processes a shard at a time. If you make those changes, we can the revisit the next questions.

stuartcarnie · 2024-08-27T04:20:56Z

cmd/influx_tools/parquet/exporter.go

+func (e *exporter) gatherSchema(start, end time.Time, measurement string, rs *storage.ResultSet) {
+	fmt.Printf("gather schema start: %s, end: %s\n", start.Format(time.RFC3339), end.Format(time.RFC3339))
+
+	for rs.Next() {
+		measurementName := string(models.ParseName(rs.Name()))
+		if measurement != "" && measurement != measurementName {
+			continue
+		}
+
+		t := e.measurements.getTable(measurementName)
+		t.addTags(rs.Tags())
+		t.addField(rs.Field(), rs.FieldType())
+	}
+}


I would recommend you use the following, more efficient approach to gather the schema in InfluxDB 1.x. As mentioned in this comment, I recommend you process and export each shard separately to 1 or more parquet field, as there can be no schema conflicts within a single shard.

Given that, you will also be able to gather the complete schema very efficiently using existing indices.

For example, the shards returned by your getShards function returns a slice of *tsdb.Shard. Using that, you can get both the exact set of tag keys for that shard, and the set of fields:

cond := influxql.MustParseExpr("_name = '<measurement name>'") shard := shards[0] tagKeys, err := e.tsdbStore.TagKeys(context.Background(), query.OpenAuthorizer, []uint64{shard.ID()}, cond) fields := shard.MeasurementFields([]byte("<measurement name>"))

You could merge together the full set of tag keys across all shards to ensure the Parquet schema tag keys are consistent, and also perform a check that the field keys are all consistent data types. I would recommend generating a warning of field conflicts, but I would suggest you still export the individual shards, as the field types will be consistent within the shard.

Thank you very much for the feedback. I have refactored the code per your suggestion, now individual shards are exported and schema is retrieved using the code above. The export is now single-pass op therefore.
Unfortunately, I still get the same wrong result (incomplete output).

@alespour looking at the code, it seems like you do not create a schema if a measurement has no tags. Even though this might be rare, it's perfectly valid to have a series without any tag... Do you by chance miss data in the output from measurements without tags?

@srebhan I believe the database and measurement (telegraf, cpu / disk etc) I 'm using for testing does not contain tag-less data.
But you are right, that needs to be fixed, thank you.

I'm using telegraf to check parquet output like this:
telegraf --config ./telegraf-parquet.conf --once

[[inputs.file]] files = ["/tmp/parquet/table-*.parquet"] name_override = "cpu" data_format = "parquet" tag_columns = ["datacenter","hostname","os","rack","region","service","team"] timestamp_column = "time" timestamp_format = "unix_ns" [[outputs.file]] ## Files to write to, "stdout" is a specially handled file. files = ["stdout"]

srebhan · 2024-10-02T19:55:36Z

Closing this in favor of #25297 which was successfully tested.

alespour added 4 commits August 15, 2024 11:12

wip: new export-parquet command

29a61b1

wip: export to parquet

ebe96ef

wip: create v2 series key in actual exporter

ccc3ac6

chore: update dependencies

e91ff94

bednar mentioned this pull request Aug 21, 2024

feat: influx inspect export parquet #25047

Open

1 task

stuartcarnie reviewed Aug 27, 2024

View reviewed changes

fix: extract schema and export per shard

3697f75

alespour requested a review from stuartcarnie August 30, 2024 10:23

alespour added 2 commits September 5, 2024 14:07

fix: return error (temporarily) when tagkeys are empty

27fc88f

fix: do not initialize maps for schema with initial size

20011cf

srebhan mentioned this pull request Sep 9, 2024

feat(influx_tools): Add export to parquet files #25297

Open

2 tasks

srebhan closed this Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: influx_tools export parquet #25253

feat: influx_tools export parquet #25253

alespour commented Aug 20, 2024

stuartcarnie left a comment

stuartcarnie Aug 27, 2024

alespour Aug 27, 2024

srebhan Sep 4, 2024

alespour Sep 5, 2024 •

edited

Loading

alespour Sep 5, 2024

srebhan commented Oct 2, 2024

feat: influx_tools export parquet #25253

feat: influx_tools export parquet #25253

Conversation

alespour commented Aug 20, 2024

stuartcarnie left a comment

Choose a reason for hiding this comment

stuartcarnie Aug 27, 2024

Choose a reason for hiding this comment

alespour Aug 27, 2024

Choose a reason for hiding this comment

srebhan Sep 4, 2024

Choose a reason for hiding this comment

alespour Sep 5, 2024 • edited Loading

Choose a reason for hiding this comment

alespour Sep 5, 2024

Choose a reason for hiding this comment

srebhan commented Oct 2, 2024

alespour Sep 5, 2024 •

edited

Loading