-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of large file transfers #966
Comments
write_chunk_init -- start a timer... |
perhaps one message per minute? |
That seems like a reasonable interval to me |
It would be awesome if it could say downloaded x/y bytes (z%) :) We don't always know the size, so sometimes it would just have to say downloaded x bytes. If that's hard to implement then don't bother, but it would be really nice for debugging and keeping an eye on slow transfers |
in write chunk is called by get() and get accepts( length as and argument. ) when the length is unknown, it's value is zero, so if the value is >0 then could use that to give a progress message. would not be a big deal. |
get() also has message as an argument, which might have a 'size' value as well. |
pass some size estimate to chunk_init() |
One other thing I just remembered: when sr3 is in the middle of a big/slow download, it can't be stopped cleanly. Either the download has to finish or it will get killed by sr3 stop. If the transfer code is being worked on, maybe we could also look into a way to interrupt and stop a transfer part-way through, then do some clean up, like putting the message into the retry queue? Let me know if I should create a separate issue. |
I feel like that should be its own issue, but when I start working on this, I could have a fix for that as well to be included? Could introduce a |
Could also just rename the issue to englobe both problems to a common one |
Ive been looking at how FTP's get method works and it looks like that it does a full download of a file in one shot (at least when its not binary). There's isn't really any logic about writing out the file in chunks and looping from what I've been seeing. That would mean that using a timer would not work for preventing sanity killing instances on big downloads. |
@petersilva proposed of having some kind of check inside of sanity that verifies if the transfer is completed or not. |
so the stuff in the transfer class would work for built in transfers... so it still needs to be done. for FTP, the thing I was hinting at was to use existing, or generate a few additional metrics. then have sr3 sanity base it's decision to kill it on the metrics. so... the metrics are generated once per batch. Say the batch is 25. So it will write to the metrics file a cumulation of statistics every 25 files.
the sr.py program already loads current metrics for each component (c), configration (cfg) and instance(i) as part of start up, example of what might work... so... the msgRate indicates how many files per second are being transferred. and so you might be able to use that... to say... that sanity should wait, say 10 times long than it takes to transfer a file on average... so if a sender has a transfer rate of 0.0004, then: 10*(1/0.004) == 2500 msgRate would work on both subscribers and senders, but it might be thrown off by rejections... this is just the general idea... sr3 sanity could walk the instance metrics... and get a better idea of what to kill from that. |
Another similar idea: sr3 could log something like And sr3 sanity could check that file and see if it's growing. I'm not sure if this is more or less complicated than calculating an estimated download rate and estimating the expected download time based off that. |
writing a file per file gets painful if you're writing hundreds of files per second... The metrics reduce that to one file per batch. You could still get the same info... there are other metrics (transferBytes) ... one could use to get more predictive metric, but I think maybe a test... if writing less than one file per second, then the one file/file thing makes sense... again... you need a metrics (but this time builtin to the flow) to descide whether to write the file or not. it could include how long it thinks it will take (combining internal current rate with size.) or even just plain estimate a completion time... or a "kill me If haven't done anything at this time." ... removing all respondibility from sanity to calculate what is "too long" |
Weird thing I started noticing... in dynamic load test, running on my laptop (new thing... ) the fan turns and the ordinarily trivial transfers via localhost ftp can crash:
This is odd. |
the transfer is lasting 8 seconds, and the timer is going off. |
I read over the logic in FTP and in transfer... and it's definitely not "in one shot" as described above, we don't use the get verb, we use retr routines, which transfer in self.o.bufsize transfers via the write_chunk() routine, and there is even an on_data() routine available for subclasses to do odd transformations on the data in flight. Anyways, I added a routine "logProgress()" that will print a message once in a while during long transfers (of FTP). if you're not using binary accellerators, then SFTP uses the same routines in the Transfer (Parent) class, so logProgress() should work as-is for sftp and http also. I think it does everything we need. There was also an instance crashing bug that occurred because the more focused try/excepts no longer catch errors where the upstream file does not exist, causing ftp subscribers to crash when dealing with that case. The actual crash it because an alarm timer is set, but never cleared when an error occurs. The fix is to cancel the timer regardless of whether the operation succeeds or fails. |
continued work on #966 having timers expire at odd times.
addressing more vulnerabilities with #966 ftp transfer class, not always cancelling timers.
when downloading large files, that are kind of slow, sanity can kill the transfer in progress.
Could address by having transfer/init.py write_chunk to emit encouraging occasional messages to convince sanity that the log is not dead.
The text was updated successfully, but these errors were encountered: