feat: Include Object Paths in S3 sync summary #17117

bbernays · 2024-03-11T20:05:31Z

This PR adds in a column that will include the full keys of all objects synced in the sync

objects_synced contains a map[string][]string where the map key is the table name and the array of strings are the object keys

{
  "_cq_source_name": "aws",
  "_cq_sync_time": "2024-03-11 17:11:16.369293",
  "cli_version": "development",
  "destination_name": "s3",
  "destination_path": "localhost:7778",
  "destination_version": null,
  "errors": 9,
  "objects_synced": {
    "aws_s3_buckets": [
      "aws_s3_buckets/8c862045-ed27-43f4-a70d-2b5f955e0748.parquet"
    ]
  },
  "resources": 29,
  "source_name": "aws",
  "source_path": "cloudquery/aws",
  "source_version": "v25.2.0",
  "sync_id": "a61ca899-1210-497b-ba1d-f442c26792a3",
  "warnings": 1
}

This is an extension of #17112

erezrokah · 2024-03-11T20:48:29Z

plugins/destination/s3/client/spec/spec.go

@@ -181,10 +181,21 @@ func (s *Spec) Validate() error {

 func (s *Spec) ReplacePathVariables(table string, fileIdentifier string, t time.Time) string {
 name := strings.ReplaceAll(s.Path, varTable, table)
+
+ if table == "cloudquery_sync_summary" {


Shouldn't this be under the option to enable the summary?

erezrokah · 2024-03-11T20:49:14Z

plugins/destination/s3/client/write.go

 objKey := c.spec.ReplacePathVariables(table.Name, uuid.NewString(), time.Now().UTC())
+ c.objectKeys[table.Name] = append(c.objectKeys[table.Name], objKey)
+
+ if table.Name == "cloudquery_sync_summary" {


This seems too tightly coupled with the CLI that uses the same name for the table

It is designed to be an extension of the CLI summary table, all this functionality is doing is adding a column to that table.

yevgenypats

Few questions:

do we already have sync summary message? is it new
will objectkeys will run out of memory and it is not bound by anything?

depends on 1 and 2 we might want to implement sync summary only on the CLI in that case and then only add file that sais this sync finished and have a path of {{sync_id}} in the config instead of sending summary to every destination (unless we already do that?)

bbernays · 2024-03-12T01:03:33Z

Few questions:

do we already have sync summary message? is it new

It will be added in #17112

will objectkeys will run out of memory and it is not bound by anything?

I have added bounding that will limit it to only log the first 10,000 keys

depends on 1 and 2 we might want to implement sync summary only on the CLI in that case and then only add file that sais this sync finished and have a path of {{sync_id}} in the config instead of sending summary to every destination (unless we already do that?)

Adding {{sync_id}} only partially fixes the problem... S3 paths are used for at least 2 distinct things:

A means to search for objects: you can search for a path prefix to find all objects with said prefix
A means to partition data in S3/Athena for time based data. Adding the UUID into that means that it would be much harder to do because the UUID must be included in the definition

With the above solution assuming less than 10,000 objects in a single sync we get best of both worlds

bbernays added 4 commits March 11, 2024 14:59

Update write.go

b2fb7ff

Create summary.go

7447042

Update spec.go

0774475

Update client.go

0a0fa21

bbernays requested a review from a team as a code owner March 11, 2024 20:05

bbernays requested review from maaarcelino and disq and removed request for a team March 11, 2024 20:05

cq-bot added the area/plugin/destination/s3 label Mar 11, 2024

erezrokah approved these changes Mar 11, 2024

View reviewed changes

erezrokah reviewed Mar 11, 2024

View reviewed changes

bbernays mentioned this pull request Mar 11, 2024

Move CQ Summary Data to Custom Message Type #17118

Open

yevgenypats requested changes Mar 12, 2024

View reviewed changes

bbernays added 2 commits March 11, 2024 19:49

Update client.go

d1e061d

Update write.go

594ecfd

Update write.go

691f757

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Include Object Paths in S3 sync summary #17117

feat: Include Object Paths in S3 sync summary #17117

bbernays commented Mar 11, 2024 •

edited

erezrokah Mar 11, 2024

erezrokah Mar 11, 2024

bbernays Mar 11, 2024

yevgenypats left a comment

bbernays commented Mar 12, 2024 •

edited

feat: Include Object Paths in S3 sync summary #17117

Are you sure you want to change the base?

feat: Include Object Paths in S3 sync summary #17117

Conversation

bbernays commented Mar 11, 2024 • edited

erezrokah Mar 11, 2024

Choose a reason for hiding this comment

erezrokah Mar 11, 2024

Choose a reason for hiding this comment

bbernays Mar 11, 2024

Choose a reason for hiding this comment

yevgenypats left a comment

Choose a reason for hiding this comment

bbernays commented Mar 12, 2024 • edited

bbernays commented Mar 11, 2024 •

edited

bbernays commented Mar 12, 2024 •

edited