-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
App hangs indefinitely after a series of SQL timeouts #6
Comments
I love this bug report. Based on the stack trace, for which I thank you a thousand times, the hang is probably happening because the bulk insert process is gummed up which is trickling through the streams. The bulk insert is reading from a DBReader that is consuming the stream that is being written to in the I could do a I only have access to one SQL Server instance which is on my local machine, so I'll have to get creative to reproduce it. I think your suggestion of a super low timeout might help. My wife is working third shift tomorrow night so I'll have some free time after I get my little guy to sleep. Thanks again for the great feedback. |
One small question - I noticed I had the timeout function as |
Yes - I had let it run for up to 8 minute retry waits. In one of the iterations, I had left it "hanging" overnight, and it never recovered. Note that in the specific state that it seems to lock up at, the exponential backoff really does not apply, as the relevant code is not actually executing bulk copy (so the Polly retry mechanism does not apply). Also, to help with the repro, reducing timeout for bulk copy + increasing the batch size to a larger value, will be aggressive enough to trigger timeouts, and thereby simulate this issue, fairly quickly. |
I adjusted the policy to be a bit more forgiving on failure var p = Policy
.Handle<SqlException>()
.WaitAndRetry(
100,
retryAttempt => TimeSpan.FromMilliseconds(Math.Min(retryAttempt * 100, 100)),
(exception, span) =>
{
bc.BatchSize = (int)(bc.BatchSize * .5);
if (span.TotalSeconds > 1)
{
AnsiConsole.MarkupLine($"[bold]{fileName}[/][red]{exception.Message}[/]{Environment.NewLine}Retrying in [blue]{span.Humanize()}[/] with a batch size of [cyan]{bc.BatchSize}[/].");
}
}); Now the retries happen quicker, but more importantly I've added some logic to cut the batch size in half for each exception. Looking at the tables you were seeing the larger content ones seem to be at fault so hopefully by sending progressively smaller and smaller batches until the timeouts stop it'll have a better chance of sucess. And because we're on the fly trying to dial that in, I'm also not displaying the error unless we've failed 10 or more times. I'd still like to find the underlying cause of this hang though. From your description there is certainly something nasty in the depths of the multithreading that can get caught. this did cause a new issue with the progress bar being reset on failure that I'm tracking with #7 too. |
I did some more experiments and I think I noticed a tragic flaw. I might be missing something, but I think because of the streaming the retry will miss records. If we are sending in batches of 10,000 of 50,000 records and batch 1,2 and 3 succeed but the fourth fails, the retry will start on the 5 batch because the Theoretically I could create a The retry might need to happen from the start of the operation rather than the middle of the copy. |
dotnet/SqlClient#38 is probably what I'd need from |
@phil-scott-78 thank you very much for looking into this. If retrying the entire file is an easier "mitigation" to put in place for now, I think it would be great from a usability point of view. No urgency though; I was just trying to validate how well the retry mechanism worked... |
@phil-scott-78 one suggestion is - given the challenge with implementing retry correctly - and assuming doing a retry at a slightly higher level within the app is not trivial - if there is any way to not block the data import when tables are present, then the user could potentially redo specific tables which had issues, rather than start from scratch for the entire set of files. From a user perspective, maybe allowing a list of tables to operate on (thereby allowing the user to manually specify "retry" in an indirect way), would be a nice-to-have mechanism. Again, this is just a thought 😄 |
Also, I have not tested your recent changes, but just curious what the behavior would be if one task failed - would it throw and cause the entire app to terminate? If that is the case, it would be great if we can contain the failure and just let that specific task (== file) abort, and let the other tasks continue (and if any of them subsequently fail for timeouts or whatever, so be it!) Basically trying to contain the impact of a SQL error to the relevant file, and letting others go on. That behavior, coupled with a potential option (as described in previous comment) to allow the user to optionally specify a list of tables to work on, could allow more manual control over the tricky larger loads which naturally have a higher chance of encountering problems. |
I'll take a look at that idea. If it wasn't for that My other idea was to create a buffer of what I'm sending in and replying it on failure before continuing on with the stream. That might actually be a decent solution for network hicuups. In the meantime, the main culprit I think was really, really large batches with the big rows that exist in PostHistory. I've drastically reduced the batch size for the import there. If you get a chance I'm curious if things improve. |
Thank you. And just to be clear, I am not hoping for an incremental restart. The expectation would be that the user runs a TRUNCATE TABLE on the specific table(s) which got interrupted earlier, and then the data load starts (on the specified tables) from scratch. I did try testing a day ago with your first iteration of reduced batch sizes, and sure it seemed to help a bit in the resource-constrained DB case. The worry is that for longer-running loads (like StackOverflow.com) which take several hours across the board, the chances of something going wrong (in the cloud) are relatively more. So in those cases being able to truncate and just start the load for specific table(s) from scratch, would be so much better than having to restart everything. Thanks again for thinking about these feedback items! |
Hey @arvindshmicrosoft, I had a few ideas while walking the dog this weekend on how to get restarting to work and so far, I've been pretty happy with the results. I'm buffering the data before the bulk insert statements are sent. If I get an exception, I clear out any data from the server that I successfully wrote and then do the buffered insert again. It's not bullet proof - if more data than I buffered was written, for example, it'll still fail. But for a lot of common issues I think it might increase the odds of success quite a bit. If you want to check it out, I have a change named I'm still going to add the option you suggested about doing individual tables. I think you are right - for the Stack database it might be needed for success for a lot of people. |
Thank you, @phil-scott-78 ! One more suggestion to help data quality checks: when the final summary is output, can we compare elements read from the XML against the ones sent to the DB? And the "humanized" output in the summary is probably counter-productive, we may want to print the exact number there? In any case, I really appreciate your great work here!
The source XML files (Dec 2021 snapshot) have the following line counts. We can deduct 2 from each line count and then expect that to match up with the row count from the above DB.
|
Oh that's a great idea. I love it. I'll see what I can do |
Thanks for adding retry support, and parallel processing! The "happy path" works well in the limited tests I did!
Keen to look at the retry handling in action, I simulated a case with an under-powered database (restricted the vCore allocation) while still using parallelism, thereby hoping to simulate timeouts. As you will see below, under this condition of stress, the app seems to hang after a while. I just thought of skipping Tags to see if the issue still reproduced, and it does. While the issue happens even with PostTags being processed, I can consistently reproduce the hang behavior with the following command line arguments (which eliminate the Tag processing from the set of variables).
import unix.stackexchange.com.7z --skipTags true -d "unix.stackexchange" -c "<REDACTED>"
I suppose another way to reproduce this issue, is to use artificially low timeouts in the code for the bulk copy tasks.
After a minute or two, timeouts pile up and eventually cause the app to go into a state where it just "hangs" (console output is at the end of this Issue). I wrestled with the issue and looked at your code for a couple of hours last night but seem to have run out of ideas; I hope you can look into this.
When the app hangs, I can confirm there is no activity on the SQL side. When I attach a debugger, it seems to show that there are basically 2 threads which are alive, and both are blocked.
The main thread seems to always be stuck here:
The call stack for the main thread shows:
The other thread seems stuck inside Soddi.Services.BlockingStream.Write:
The call stack for that thread is:
The other threads are all executing non-user code as far as I can see. Below is the console output, as I said it just seems to hang indefinitely after those SQL timeout exceptions. I have tried this multiple times now, and it seems to reproduce consistently for different archives. I tried adding some missing
using
directives to ensure timely auto-disposal of objects where needed, but that did not seem to make any difference. I've also let it be as-is for several hours, that did not make any difference. Again, at that time, I do not see any connections to SQL at all. And different executions will show potentially 1 or more tasks getting stuck in this way. Most of the time it is the Posts / PostHistory but I have also noticed Users and other tasks sometime getting "stuck" like this.The text was updated successfully, but these errors were encountered: