Resume partial downloads. #317

Wedge009 · 2024-12-17T01:03:06Z

I see particularly large results uploads (eg ~150 MiB) running into Failed response: EOF fairly frequently and it sometimes takes several hours - even more than day, running into the risk of missing the deadline - to complete a result upload because with every 'failed response' it has to restart from 0%.

I don't seen any option to enable debug-level output, so I can't tell if the issue is at my end or the server's.

I'm not sure if this is feasible or even possible with HTTP POST but could we have the ability to ask server what it did manage to receive and continue from there? (Or would that run into even more difficulty with respect to transmission integrity, etc?)

Perhaps a separate issue, but I have problems uploading on a lone Windows 7 host with three results stuck at 0%. Previously I had trouble with the v8 client even downloading tasks in the first place, so I reverted to the v7 client. But since v8.4 beta release I managed to get some work downloaded and now have run into stuck uploads. Not sure if it could be something certificate-related in Windows 7.

The text was updated successfully, but these errors were encountered:

Wedge009 · 2024-12-17T02:48:52Z

A sample of what I'm currently (2024-12-17 02:37 UTC) experiencing (same work-unit):

20:27:41:I1:WU797:Completed 625000 out of 625000 steps (100%)
20:27:41:I1:WU797:Saving result file ../logfile_01.txt
20:27:41:I1:WU797:Saving result file frame770.gro
20:27:41:I1:WU797:Saving result file frame770.xtc
20:27:42:I1:WU797:Saving result file md.log
20:27:42:I1:WU797:Saving result file science.log
20:27:42:I1:WU797:Saving result file state.cpt
20:27:42:I1:WU797:Folding@home Core Shutdown: FINISHED_UNIT
20:27:42:I1:WU797:Core returned FINISHED_UNIT (100)
20:27:44:I1:WU797:Uploading WU results
20:28:40:I1:WU797:UPLOAD 1% 1.52MiB of 158.92MiB
20:30:13:I1:WU797:UPLOAD 2% 3.21MiB of 158.92MiB
21:26:44:I1:WU797:UPLOAD 40% 63.53MiB of 158.92MiB
21:27:46:E :OUT157:Failed response: EOF
21:27:46:I1:WU797:Uploading WU results
21:27:47:I1:OUT160:> POST https://highland3.seas.upenn.edu/api/results HTTP/1.1
21:29:05:I1:WU797:UPLOAD 1% 1.54MiB of 158.92MiB
21:30:31:I1:WU797:UPLOAD 2% 3.15MiB of 158.92MiB
22:27:17:I1:WU797:UPLOAD 41% 65.12MiB of 158.92MiB
22:27:47:E :OUT160:Failed response: EOF
22:27:47:I1:WU797:Uploading WU results
22:27:49:I1:OUT161:> POST https://vav24.fah.temple.edu/api/results HTTP/1.1
22:27:49:E :OUT161:Failed response: EOF
22:27:49:I1:WU797:Uploading WU results
22:27:51:I1:OUT162:> POST https://ds03.scs.illinois.edu/api/results HTTP/1.1
22:29:03:I1:WU797:UPLOAD 1% 1.51MiB of 158.92MiB
22:30:27:I1:WU797:UPLOAD 2% 3.15MiB of 158.92MiB
23:27:41:I1:WU797:UPLOAD 46% 73.04MiB of 158.92MiB
23:27:51:E :OUT162:Failed response: EOF
23:27:51:I1:WU797:Uploading WU results
23:27:53:I1:OUT166:> POST https://highland4.seas.upenn.edu/api/results HTTP/1.1
23:27:54:E :OUT166:Failed response: EOF
23:27:54:I1:WU797:Uploading WU results
23:27:55:I1:OUT167:> POST https://ds01.scs.illinois.edu/api/results HTTP/1.1
23:29:03:I1:WU797:UPLOAD 1% 1.54MiB of 158.92MiB
23:30:36:I1:WU797:UPLOAD 2% 3.10MiB of 158.92MiB
23:34:19:I1:WU797:UPLOAD 5% 7.92MiB of 158.92MiB
23:34:40:E :OUT167:Failed response: EOF
23:34:40:I1:WU797:Uploading WU results
23:34:41:I1:OUT168:> POST https://fahserver1.flatironinstitute.org/api/results HTTP/1.1
23:35:53:I1:WU797:UPLOAD 1% 1.51MiB of 158.92MiB
23:37:08:I1:WU797:UPLOAD 2% 3.15MiB of 158.92MiB
00:34:36:I1:WU797:UPLOAD 45% 71.48MiB of 158.92MiB
00:34:41:E :OUT168:Failed response: EOF
00:34:41:I1:WU797:Uploading WU results
00:34:42:I1:OUT169:> POST https://highland3.seas.upenn.edu/api/results HTTP/1.1
00:35:36:I1:WU797:UPLOAD 1% 1.51MiB of 158.92MiB
00:36:41:I1:WU797:UPLOAD 2% 3.15MiB of 158.92MiB
01:33:49:I1:WU797:UPLOAD 49% 77.87MiB of 158.92MiB
01:34:46:E :OUT169:Failed response: EOF
01:34:46:I1:WU797:Uploading WU results
01:34:47:I1:OUT170:> POST https://vav24.fah.temple.edu/api/results HTTP/1.1
01:34:48:E :OUT170:Failed response: EOF
01:34:48:I1:WU797:Uploading WU results
01:34:49:I1:OUT171:> POST https://ds03.scs.illinois.edu/api/results HTTP/1.1
01:35:45:I1:WU797:UPLOAD 1% 1.52MiB of 158.92MiB
01:36:37:I1:WU797:UPLOAD 2% 3.15MiB of 158.92MiB
02:34:42:I1:WU797:UPLOAD 74% 117.53MiB of 158.92MiB
02:34:50:E :OUT171:Failed response: EOF
02:34:50:I1:WU797:Uploading WU results
02:34:51:I1:OUT172:> POST https://highland4.seas.upenn.edu/api/results HTTP/1.1
02:34:51:E :OUT172:Failed response: EOF
02:34:51:I1:WU797:Uploading WU results
02:34:53:I1:OUT173:> POST https://ds01.scs.illinois.edu/api/results HTTP/1.1
02:36:01:I1:WU797:UPLOAD 1% 1.55MiB of 158.92MiB
02:37:26:I1:WU797:UPLOAD 2% 3.11MiB of 158.92MiB

Wedge009 · 2024-12-18T00:53:30Z

I have problems uploading on a lone Windows 7 host with three results stuck at 0%.

I think this might be a non-issue related to #130. I was sure that I saw upload progress on Windows previously, but I must have been mistaken. Being stuck on 0% must have been a factor of this issue here, where uploads were constantly restarting from 0%. And when three uploads were in progress simultaneously, no more work was being downloaded.

kbernhagen · 2024-12-18T00:59:47Z

Could this be the MTU size problem?
https://foldingforum.org/viewtopic.php?t=42248

Wedge009 · 2024-12-18T02:03:53Z

All my hosts running fah are connected via ethernet and I verified MTU is the standard 1500.

I suspect the problem might be at my end - if you see the above output failures were on multiple servers and I doubt they are all broken. Also, today is much milder than recent days (was over 40C yesterday) and I noticed network transmission seems more stable. (That repeatedly failing upload eventually completed overnight, at 17:34:48 UTC, over 21 hours after task completion.)

So that brings back my original question on whether it's possible to implement some sort of resume functionality for potentially unreliable connections. Spending several hours on a task and then not being able to upload seems rather wasteful. (I've had a few tasks expire or be marked as 'failed'.)

jcoffland · 2024-12-18T11:43:23Z

This is possible and something I've had in mind for some time now. It will require changes to all of the Work Servers (WS) so they can save and continue partial uploads. Then the client itself must also support this. It would likely result in significant network bandwidth savings but it's going to take some time and there are currently other higher priority items.

Wedge009 · 2024-12-18T12:26:59Z

No worries, just good to know it's possible and part of the plan. Thanks.

Wedge009 · 2025-02-05T08:37:19Z

I can report in a separate issue if it's suitable, but I wonder if this is related to result uploads. If a host temporarily loses connection for whatever reason, I find that even a successful upload later on seems to result in a failure return from the server.

('~50 days ago' relates to the server upload problems I mentioned above in December 2024 - I really dislike relative time-stamps.)

And somewhat related to that (again, can report separately if required), I find sometimes work-units are too easily dumped. Sometimes even something as innocuous as a clean reboot, upon restarting the client can complain about not being able to find any results so it dumps the task. Quite frustrating when progress was close to complete.

These aren't huge problems, but I find the wasted energy on dumped/failed/rejected/etc work quite disappointing - and I didn't find this to be an issue with the v7 client.

muziqaz · 2025-02-05T08:45:36Z

Windows client has an issue with Windows stabbing it in the back every time system is rebooted. So I til the fix is sorted we advise Windows users to pause their work before rebooting.
Linux has no issues there

Wedge009 · 2025-02-05T09:18:45Z

I did notice it was usually on Windows... a shame it's something that has to be done manually - sometimes a reboot might not come at a good time, particularly if the user is on a recent Windows that reboots automatically all the time.

muziqaz · 2025-02-05T09:40:31Z

Unfortunately the issue is quite difficult to fix since it is more Windows rather than FAHClient

Wedge009 · 2025-02-05T09:48:20Z

I feel this is getting off-topic for the actual issue about uploads, but what do you mean about 'more (to do with) Windows'? Uncontrolled rebooting? Yeah, that's a Windows problem. But the Windows fah-client not resuming properly - as I mentioned above that didn't seem to be a problem with the v7 client. I don't know what the architectural changes are between v7 and v8 but it must be pretty significant for this to be accepted as standard for Windows hosts.

muziqaz · 2025-02-05T10:20:43Z

It is not accepted as standard. Just the fix is a bit elusive

Wedge009 · 2025-02-08T01:23:32Z

I made the mistake of rebooting during an upload and forgetting to pause the current task... 🤦

muziqaz · 2025-02-08T07:54:18Z

Pausing and then rebooting while uploading the WU would not have saved it.
I know the bug is FAH's fault, but at some point there needs to be a bit of care and responsibility to be had when participating in projects like FAH. :)

Wedge009 · 2025-02-08T08:12:59Z

The task with the interrupted upload would still be marked as failed, but the pause would have at least saved the other task. Fortunately progress was only 3.6%.

Most of my hosts are Linux-based, but I was in such a hurry to reboot the Windows host in this particular instance that by the time I remembered to check what FaH was doing it was too late. As I said, this was never an issue with the v7 client, so at the very least this sort of behaviour should be considered a regression.

muziqaz · 2025-02-08T09:04:50Z

Make no mistake, this is considered a bug and very serious one. This is in no way considered a normal and acceptable behaviour from any software

jcoffland changed the title ~~Is it possible to implement result upload resuming?~~ Resume partial downloads. Dec 18, 2024

jcoffland added the enhancement New feature or request label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume partial downloads. #317

Resume partial downloads. #317

Wedge009 commented Dec 17, 2024

Wedge009 commented Dec 17, 2024

Wedge009 commented Dec 18, 2024 •

edited

Loading

kbernhagen commented Dec 18, 2024

Wedge009 commented Dec 18, 2024 •

edited

Loading

jcoffland commented Dec 18, 2024

Wedge009 commented Dec 18, 2024

Wedge009 commented Feb 5, 2025

muziqaz commented Feb 5, 2025

Wedge009 commented Feb 5, 2025

muziqaz commented Feb 5, 2025

Wedge009 commented Feb 5, 2025

muziqaz commented Feb 5, 2025

Wedge009 commented Feb 8, 2025

muziqaz commented Feb 8, 2025

Wedge009 commented Feb 8, 2025 •

edited

Loading

muziqaz commented Feb 8, 2025

Resume partial downloads. #317

Resume partial downloads. #317

Comments

Wedge009 commented Dec 17, 2024

Wedge009 commented Dec 17, 2024

Wedge009 commented Dec 18, 2024 • edited Loading

kbernhagen commented Dec 18, 2024

Wedge009 commented Dec 18, 2024 • edited Loading

jcoffland commented Dec 18, 2024

Wedge009 commented Dec 18, 2024

Wedge009 commented Feb 5, 2025

muziqaz commented Feb 5, 2025

Wedge009 commented Feb 5, 2025

muziqaz commented Feb 5, 2025

Wedge009 commented Feb 5, 2025

muziqaz commented Feb 5, 2025

Wedge009 commented Feb 8, 2025

muziqaz commented Feb 8, 2025

Wedge009 commented Feb 8, 2025 • edited Loading

muziqaz commented Feb 8, 2025

Wedge009 commented Dec 18, 2024 •

edited

Loading

Wedge009 commented Dec 18, 2024 •

edited

Loading

Wedge009 commented Feb 8, 2025 •

edited

Loading