Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in selective pushing/pulling from s3 #23

Open
larryfenn opened this issue Dec 19, 2019 · 3 comments
Open

Add in selective pushing/pulling from s3 #23

larryfenn opened this issue Dec 19, 2019 · 3 comments

Comments

@larryfenn
Copy link
Contributor

Right now, datakit data pushes and pulls the entire contents of the data directory, even for files that haven't changed. This causes pushes and pulls to take much more time than they should in some projects.

Instead, datakit data should only push and pull files that have changed between disk and s3. Is there a s3 flag that we can pass through (and then make the default)?

@zstumgoren
Copy link
Contributor

@larryfenn The push/pull commands delegate to the AWS cli sync command, which should only send new or updated files to the target destination.

When you run datakit data [push|pull], the exact AWS cli command that is being executed should be printed to the shell along with the list of files that were uploaded/downloaded.

Can you run a test in your shell that demonstrates a case where files that have not been changed are actually being synced and paste the shell session contents here?

@zstumgoren
Copy link
Contributor

zstumgoren commented Dec 19, 2019

@larryfenn One additional thought -- I wonder if the time delay you mentioned is not in fact due to sending of unchanged files, but in performing a diff operation to determine what has actually changed between the source and destination (akin to the long wait you might experience when using rsync). If you can send a shell session where you experienced a long wait time, that could help pinpoint the nature of the issue/bug.

@zstumgoren
Copy link
Contributor

@larryfenn Sorry, one final thought as a stop-gap workflow. The push|pull commands only support pass-through of boolean flags to the underlying AWS utility. For example: datakit data push delete (which would delete any files on S3 that no longer exist locally).

But the AWS sync command also has --include and --exclude flags that use patterns to target/exclude files for the sync operation. As a temporary workaround, you might want to run datakit data push dryrun to obtain the AWS command on the shell. Then update that raw AWS command with an --include or --exclude flag that minimizes the delay. Bit of a kluge, but might speed things up for the time being until we get to the root of the underlying problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants