Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Include Object Paths in S3 sync summary #17117

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

bbernays
Copy link
Collaborator

@bbernays bbernays commented Mar 11, 2024

This PR adds in a column that will include the full keys of all objects synced in the sync

objects_synced contains a map[string][]string where the map key is the table name and the array of strings are the object keys

{
  "_cq_source_name": "aws",
  "_cq_sync_time": "2024-03-11 17:11:16.369293",
  "cli_version": "development",
  "destination_name": "s3",
  "destination_path": "localhost:7778",
  "destination_version": null,
  "errors": 9,
  "objects_synced": {
    "aws_s3_buckets": [
      "aws_s3_buckets/8c862045-ed27-43f4-a70d-2b5f955e0748.parquet"
    ]
  },
  "resources": 29,
  "source_name": "aws",
  "source_path": "cloudquery/aws",
  "source_version": "v25.2.0",
  "sync_id": "a61ca899-1210-497b-ba1d-f442c26792a3",
  "warnings": 1
}

This is an extension of #17112

@bbernays bbernays requested a review from a team as a code owner March 11, 2024 20:05
@bbernays bbernays requested review from maaarcelino and disq and removed request for a team March 11, 2024 20:05
@@ -181,10 +181,21 @@ func (s *Spec) Validate() error {

func (s *Spec) ReplacePathVariables(table string, fileIdentifier string, t time.Time) string {
name := strings.ReplaceAll(s.Path, varTable, table)

if table == "cloudquery_sync_summary" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be under the option to enable the summary?

objKey := c.spec.ReplacePathVariables(table.Name, uuid.NewString(), time.Now().UTC())
c.objectKeys[table.Name] = append(c.objectKeys[table.Name], objKey)

if table.Name == "cloudquery_sync_summary" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too tightly coupled with the CLI that uses the same name for the table

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is designed to be an extension of the CLI summary table, all this functionality is doing is adding a column to that table.

Copy link
Member

@yevgenypats yevgenypats left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions:

  1. do we already have sync summary message? is it new
  2. will objectkeys will run out of memory and it is not bound by anything?

depends on 1 and 2 we might want to implement sync summary only on the CLI in that case and then only add file that sais this sync finished and have a path of {{sync_id}} in the config instead of sending summary to every destination (unless we already do that?)

@bbernays
Copy link
Collaborator Author

bbernays commented Mar 12, 2024

Few questions:

  1. do we already have sync summary message? is it new

It will be added in #17112

  1. will objectkeys will run out of memory and it is not bound by anything?

I have added bounding that will limit it to only log the first 10,000 keys

depends on 1 and 2 we might want to implement sync summary only on the CLI in that case and then only add file that sais this sync finished and have a path of {{sync_id}} in the config instead of sending summary to every destination (unless we already do that?)

Adding {{sync_id}} only partially fixes the problem... S3 paths are used for at least 2 distinct things:

  1. A means to search for objects: you can search for a path prefix to find all objects with said prefix
  2. A means to partition data in S3/Athena for time based data. Adding the UUID into that means that it would be much harder to do because the UUID must be included in the definition

With the above solution assuming less than 10,000 objects in a single sync we get best of both worlds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

None yet

4 participants