Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume partial downloads. #317

Open
Wedge009 opened this issue Dec 17, 2024 · 6 comments
Open

Resume partial downloads. #317

Wedge009 opened this issue Dec 17, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@Wedge009
Copy link

I see particularly large results uploads (eg ~150 MiB) running into Failed response: EOF fairly frequently and it sometimes takes several hours - even more than day, running into the risk of missing the deadline - to complete a result upload because with every 'failed response' it has to restart from 0%.

I don't seen any option to enable debug-level output, so I can't tell if the issue is at my end or the server's.

I'm not sure if this is feasible or even possible with HTTP POST but could we have the ability to ask server what it did manage to receive and continue from there? (Or would that run into even more difficulty with respect to transmission integrity, etc?)


Perhaps a separate issue, but I have problems uploading on a lone Windows 7 host with three results stuck at 0%. Previously I had trouble with the v8 client even downloading tasks in the first place, so I reverted to the v7 client. But since v8.4 beta release I managed to get some work downloaded and now have run into stuck uploads. Not sure if it could be something certificate-related in Windows 7.

@Wedge009
Copy link
Author

A sample of what I'm currently (2024-12-17 02:37 UTC) experiencing (same work-unit):

20:27:41:I1:WU797:Completed 625000 out of 625000 steps (100%)
20:27:41:I1:WU797:Saving result file ../logfile_01.txt
20:27:41:I1:WU797:Saving result file frame770.gro
20:27:41:I1:WU797:Saving result file frame770.xtc
20:27:42:I1:WU797:Saving result file md.log
20:27:42:I1:WU797:Saving result file science.log
20:27:42:I1:WU797:Saving result file state.cpt
20:27:42:I1:WU797:Folding@home Core Shutdown: FINISHED_UNIT
20:27:42:I1:WU797:Core returned FINISHED_UNIT (100)
20:27:44:I1:WU797:Uploading WU results
20:28:40:I1:WU797:UPLOAD 1% 1.52MiB of 158.92MiB
20:30:13:I1:WU797:UPLOAD 2% 3.21MiB of 158.92MiB
21:26:44:I1:WU797:UPLOAD 40% 63.53MiB of 158.92MiB
21:27:46:E :OUT157:Failed response: EOF
21:27:46:I1:WU797:Uploading WU results
21:27:47:I1:OUT160:> POST https://highland3.seas.upenn.edu/api/results HTTP/1.1
21:29:05:I1:WU797:UPLOAD 1% 1.54MiB of 158.92MiB
21:30:31:I1:WU797:UPLOAD 2% 3.15MiB of 158.92MiB
22:27:17:I1:WU797:UPLOAD 41% 65.12MiB of 158.92MiB
22:27:47:E :OUT160:Failed response: EOF
22:27:47:I1:WU797:Uploading WU results
22:27:49:I1:OUT161:> POST https://vav24.fah.temple.edu/api/results HTTP/1.1
22:27:49:E :OUT161:Failed response: EOF
22:27:49:I1:WU797:Uploading WU results
22:27:51:I1:OUT162:> POST https://ds03.scs.illinois.edu/api/results HTTP/1.1
22:29:03:I1:WU797:UPLOAD 1% 1.51MiB of 158.92MiB
22:30:27:I1:WU797:UPLOAD 2% 3.15MiB of 158.92MiB
23:27:41:I1:WU797:UPLOAD 46% 73.04MiB of 158.92MiB
23:27:51:E :OUT162:Failed response: EOF
23:27:51:I1:WU797:Uploading WU results
23:27:53:I1:OUT166:> POST https://highland4.seas.upenn.edu/api/results HTTP/1.1
23:27:54:E :OUT166:Failed response: EOF
23:27:54:I1:WU797:Uploading WU results
23:27:55:I1:OUT167:> POST https://ds01.scs.illinois.edu/api/results HTTP/1.1
23:29:03:I1:WU797:UPLOAD 1% 1.54MiB of 158.92MiB
23:30:36:I1:WU797:UPLOAD 2% 3.10MiB of 158.92MiB
23:34:19:I1:WU797:UPLOAD 5% 7.92MiB of 158.92MiB
23:34:40:E :OUT167:Failed response: EOF
23:34:40:I1:WU797:Uploading WU results
23:34:41:I1:OUT168:> POST https://fahserver1.flatironinstitute.org/api/results HTTP/1.1
23:35:53:I1:WU797:UPLOAD 1% 1.51MiB of 158.92MiB
23:37:08:I1:WU797:UPLOAD 2% 3.15MiB of 158.92MiB
00:34:36:I1:WU797:UPLOAD 45% 71.48MiB of 158.92MiB
00:34:41:E :OUT168:Failed response: EOF
00:34:41:I1:WU797:Uploading WU results
00:34:42:I1:OUT169:> POST https://highland3.seas.upenn.edu/api/results HTTP/1.1
00:35:36:I1:WU797:UPLOAD 1% 1.51MiB of 158.92MiB
00:36:41:I1:WU797:UPLOAD 2% 3.15MiB of 158.92MiB
01:33:49:I1:WU797:UPLOAD 49% 77.87MiB of 158.92MiB
01:34:46:E :OUT169:Failed response: EOF
01:34:46:I1:WU797:Uploading WU results
01:34:47:I1:OUT170:> POST https://vav24.fah.temple.edu/api/results HTTP/1.1
01:34:48:E :OUT170:Failed response: EOF
01:34:48:I1:WU797:Uploading WU results
01:34:49:I1:OUT171:> POST https://ds03.scs.illinois.edu/api/results HTTP/1.1
01:35:45:I1:WU797:UPLOAD 1% 1.52MiB of 158.92MiB
01:36:37:I1:WU797:UPLOAD 2% 3.15MiB of 158.92MiB
02:34:42:I1:WU797:UPLOAD 74% 117.53MiB of 158.92MiB
02:34:50:E :OUT171:Failed response: EOF
02:34:50:I1:WU797:Uploading WU results
02:34:51:I1:OUT172:> POST https://highland4.seas.upenn.edu/api/results HTTP/1.1
02:34:51:E :OUT172:Failed response: EOF
02:34:51:I1:WU797:Uploading WU results
02:34:53:I1:OUT173:> POST https://ds01.scs.illinois.edu/api/results HTTP/1.1
02:36:01:I1:WU797:UPLOAD 1% 1.55MiB of 158.92MiB
02:37:26:I1:WU797:UPLOAD 2% 3.11MiB of 158.92MiB

@Wedge009
Copy link
Author

Wedge009 commented Dec 18, 2024

I have problems uploading on a lone Windows 7 host with three results stuck at 0%.

I think this might be a non-issue related to #130. I was sure that I saw upload progress on Windows previously, but I must have been mistaken. Being stuck on 0% must have been a factor of this issue here, where uploads were constantly restarting from 0%. And when three uploads were in progress simultaneously, no more work was being downloaded.

@kbernhagen
Copy link
Contributor

Could this be the MTU size problem?
https://foldingforum.org/viewtopic.php?t=42248

@Wedge009
Copy link
Author

Wedge009 commented Dec 18, 2024

All my hosts running fah are connected via ethernet and I verified MTU is the standard 1500.

I suspect the problem might be at my end - if you see the above output failures were on multiple servers and I doubt they are all broken. Also, today is much milder than recent days (was over 40C yesterday) and I noticed network transmission seems more stable. (That repeatedly failing upload eventually completed overnight, at 17:34:48 UTC, over 21 hours after task completion.)

So that brings back my original question on whether it's possible to implement some sort of resume functionality for potentially unreliable connections. Spending several hours on a task and then not being able to upload seems rather wasteful. (I've had a few tasks expire or be marked as 'failed'.)

@jcoffland jcoffland changed the title Is it possible to implement result upload resuming? Resume partial downloads. Dec 18, 2024
@jcoffland jcoffland added the enhancement New feature or request label Dec 18, 2024
@jcoffland
Copy link
Member

This is possible and something I've had in mind for some time now. It will require changes to all of the Work Servers (WS) so they can save and continue partial uploads. Then the client itself must also support this. It would likely result in significant network bandwidth savings but it's going to take some time and there are currently other higher priority items.

@Wedge009
Copy link
Author

No worries, just good to know it's possible and part of the plan. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants