Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

App hangs indefinitely after a series of SQL timeouts #6

Open
arvindshmicrosoft opened this issue Jan 25, 2022 · 15 comments
Open

App hangs indefinitely after a series of SQL timeouts #6

arvindshmicrosoft opened this issue Jan 25, 2022 · 15 comments

Comments

@arvindshmicrosoft
Copy link
Contributor

arvindshmicrosoft commented Jan 25, 2022

Thanks for adding retry support, and parallel processing! The "happy path" works well in the limited tests I did!

Keen to look at the retry handling in action, I simulated a case with an under-powered database (restricted the vCore allocation) while still using parallelism, thereby hoping to simulate timeouts. As you will see below, under this condition of stress, the app seems to hang after a while. I just thought of skipping Tags to see if the issue still reproduced, and it does. While the issue happens even with PostTags being processed, I can consistently reproduce the hang behavior with the following command line arguments (which eliminate the Tag processing from the set of variables).

import unix.stackexchange.com.7z --skipTags true -d "unix.stackexchange" -c "<REDACTED>"

I suppose another way to reproduce this issue, is to use artificially low timeouts in the code for the bulk copy tasks.

After a minute or two, timeouts pile up and eventually cause the app to go into a state where it just "hangs" (console output is at the end of this Issue). I wrestled with the issue and looked at your code for a couple of hours last night but seem to have run out of ideas; I hope you can look into this.

When the app hangs, I can confirm there is no activity on the SQL side. When I attach a debugger, it seems to show that there are basically 2 threads which are alive, and both are blocked.

The main thread seems to always be stuck here:

==> Parallel.ForEach(_processor.GetFiles(), batch =>
        {
            var insertTasks = new List<Task>();
            foreach (var (fileName, stream, fileSize) in batch)

The call stack for the main thread shows:

>	Soddi.dll!Soddi.Tasks.SqlServer.InsertData.Go(System.IProgress<(string taskId, string message, double weight, double maxValue)> progress) Line 32	C#
 	Soddi.dll!Soddi.ImportHandler.ExecuteAsync.AnonymousMethod__1(Spectre.Console.ProgressContext ctx) Line 156	C#
 	[External Code]	
 	Soddi.dll!Soddi.ImportHandler.ExecuteAsync(Spectre.Console.Cli.CommandContext context, Soddi.ImportOptions request) Line 141	C#
 	[External Code]	
 	Soddi.dll!Program.<Main>$(string[] args) Line 43	C#

The other thread seems stuck inside Soddi.Services.BlockingStream.Write:

    public override void Write(byte[] buffer, int offset, int count)
    {
        ValidateBufferArgs(buffer, offset, count);

        var newBuf = new byte[count];
        Array.Copy(buffer, offset, newBuf, 0, count);
===>        _blocks.Add(newBuf);
        TotalBytesWritten += count;
        WriteCount++;
    }

The call stack for that thread is:

 	[External Code]	
>	Soddi.dll!Soddi.Services.BlockingStream.Write(byte[] buffer, int offset, int count) Line 85	C#
 	[External Code]	
 	Soddi.dll!Soddi.Tasks.SqlServer.InsertData.Go.AnonymousMethod__1() Line 41	C#
 	[External Code]	
 	[Waiting on Async Operation, double-click or press enter to view Async Call Stacks]	
 	Soddi.dll!Soddi.Tasks.SqlServer.InsertData.Go.AnonymousMethod__0(System.Collections.Generic.IEnumerable<(string fileName, System.IO.Stream stream, long size)> batch) Line 100	C#

The other threads are all executing non-user code as far as I can see. Below is the console output, as I said it just seems to hang indefinitely after those SQL timeout exceptions. I have tried this multiple times now, and it seems to reproduce consistently for different archives. I tried adding some missing using directives to ensure timely auto-disposal of objects where needed, but that did not seem to make any difference. I've also let it be as-is for several hours, that did not make any difference. Again, at that time, I do not see any connections to SQL at all. And different executions will show potentially 1 or more tasks getting stuck in this way. Most of the time it is the Posts / PostHistory but I have also noticed Users and other tasks sometime getting "stuck" like this.

The statement has been terminated.. Retrying in 2 seconds.━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
Error: Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.. Retrying in 2 seconds.
Error: Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
Operation cancelled by user.               ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
The statement has been terminated.. Retrying in 2 seconds.━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
Error: Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
Operation cancelled by user.               ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
The statement has been terminated.. Retrying in 4 seconds.━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
Error: Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.. Retrying in 4 seconds.
Error: Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
Operation cancelled by user.               ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
The statement has been terminated.. Retrying in 4 seconds.━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
Error: Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
Operation cancelled by user.               ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
The statement has been terminated.. Retrying in 8 seconds.━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
Error: Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.. Retrying in 8 seconds.
Error: Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
Operation cancelled by user.               ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
The statement has been terminated.. Retrying in 16 seconds.━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
Error: Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
Operation cancelled by user.               ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
The statement has been terminated.. Retrying in 16 seconds.━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
Error: Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
Operation cancelled by user.               ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
The statement has been terminated.. Retrying in 32 seconds.━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00

✔ Creating database                        ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
✔ Creating objects                         ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
✔ Adding constraints                       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
✔ Inserting type values                    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
✔ tags.xml                                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
✔ postlinks.xml                            ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
✔ badges.xml                               ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
✔ votes.xml                                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
⣽ posthistory.xml (30k rows)               ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   1% 00:03:19
✔ comments.xml                             ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
✔ posts.xml                                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
✔ users.xml                                ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 00:00:00
@phil-scott-78
Copy link
Owner

I love this bug report.

Based on the stack trace, for which I thank you a thousand times, the hang is probably happening because the bulk insert process is gummed up which is trickling through the streams. The bulk insert is reading from a DBReader that is consuming the stream that is being written to in the BlockingStream and while it reads it clears out the buffer. With no space the decompression thread just sits and waits.

I could do a TryAdd here with a timeout and at least just fail rather than have a hang, but obviously getting to the bottom of why the insert process stopped reading and left the buffer full is the best bug fix.

I only have access to one SQL Server instance which is on my local machine, so I'll have to get creative to reproduce it. I think your suggestion of a super low timeout might help. My wife is working third shift tomorrow night so I'll have some free time after I get my little guy to sleep.

Thanks again for the great feedback.

@phil-scott-78
Copy link
Owner

One small question - I noticed I had the timeout function as Pow(2, retryCount) - this would mean after the 32s timeout the next one would be over a minute. I'm assuming you are giving it longer than that to get its bearings, but I just wanted to confirm before getting too deep digging. To be honest I just copied that policy over from another app without a UI so long retry waits weren't a huge deal after than many failures, but I'll push some new code that caps it at 5s or so.

@arvindshmicrosoft
Copy link
Contributor Author

Yes - I had let it run for up to 8 minute retry waits. In one of the iterations, I had left it "hanging" overnight, and it never recovered. Note that in the specific state that it seems to lock up at, the exponential backoff really does not apply, as the relevant code is not actually executing bulk copy (so the Polly retry mechanism does not apply).

Also, to help with the repro, reducing timeout for bulk copy + increasing the batch size to a larger value, will be aggressive enough to trigger timeouts, and thereby simulate this issue, fairly quickly.

@phil-scott-78
Copy link
Owner

phil-scott-78 commented Jan 26, 2022

I adjusted the policy to be a bit more forgiving on failure

        var p = Policy
            .Handle<SqlException>()
            .WaitAndRetry(
                100,
                retryAttempt => TimeSpan.FromMilliseconds(Math.Min(retryAttempt * 100, 100)),
                (exception, span) =>
                {
                    bc.BatchSize = (int)(bc.BatchSize * .5);
                    if (span.TotalSeconds > 1)
                    {
                        AnsiConsole.MarkupLine($"[bold]{fileName}[/][red]{exception.Message}[/]{Environment.NewLine}Retrying in [blue]{span.Humanize()}[/] with a batch size of [cyan]{bc.BatchSize}[/].");
                    }
                });

Now the retries happen quicker, but more importantly I've added some logic to cut the batch size in half for each exception. Looking at the tables you were seeing the larger content ones seem to be at fault so hopefully by sending progressively smaller and smaller batches until the timeouts stop it'll have a better chance of sucess.

And because we're on the fly trying to dial that in, I'm also not displaying the error unless we've failed 10 or more times.

I'd still like to find the underlying cause of this hang though. From your description there is certainly something nasty in the depths of the multithreading that can get caught.

this did cause a new issue with the progress bar being reset on failure that I'm tracking with #7 too.

@phil-scott-78
Copy link
Owner

I did some more experiments and I think I noticed a tragic flaw. I might be missing something, but I think because of the streaming the retry will miss records.

If we are sending in batches of 10,000 of 50,000 records and batch 1,2 and 3 succeed but the fourth fails, the retry will start on the 5 batch because the DbReader that I'm streaming from has already pushed out the fourth batch and has no idea anything failed or even can go back to replay them.

Theoretically I could create a DbReader that acts as a buffer that only clears once I'm assured the bulk copy has succeed copying the records in the buffer, but at a glance I'm not sure enough information is exposed by SqlBulkCopy to let me know when that happens.

The retry might need to happen from the start of the operation rather than the middle of the copy.

@phil-scott-78
Copy link
Owner

dotnet/SqlClient#38 is probably what I'd need from SqlBulkCopy to enable the retry logic

@arvindshmicrosoft
Copy link
Contributor Author

@phil-scott-78 thank you very much for looking into this. If retrying the entire file is an easier "mitigation" to put in place for now, I think it would be great from a usability point of view. No urgency though; I was just trying to validate how well the retry mechanism worked...

@arvindshmicrosoft
Copy link
Contributor Author

@phil-scott-78 one suggestion is - given the challenge with implementing retry correctly - and assuming doing a retry at a slightly higher level within the app is not trivial - if there is any way to not block the data import when tables are present, then the user could potentially redo specific tables which had issues, rather than start from scratch for the entire set of files. From a user perspective, maybe allowing a list of tables to operate on (thereby allowing the user to manually specify "retry" in an indirect way), would be a nice-to-have mechanism. Again, this is just a thought 😄

@arvindshmicrosoft
Copy link
Contributor Author

Also, I have not tested your recent changes, but just curious what the behavior would be if one task failed - would it throw and cause the entire app to terminate? If that is the case, it would be great if we can contain the failure and just let that specific task (== file) abort, and let the other tasks continue (and if any of them subsequently fail for timeouts or whatever, so be it!) Basically trying to contain the impact of a SQL error to the relevant file, and letting others go on. That behavior, coupled with a potential option (as described in previous comment) to allow the user to optionally specify a list of tables to work on, could allow more manual control over the tricky larger loads which naturally have a higher chance of encountering problems.

@phil-scott-78
Copy link
Owner

I'll take a look at that idea. If it wasn't for that PostTags table that I'm creating which is generated simultaneously with the Posts insert it would be a lot easier. Getting things back in a state where it is clean after a failure might be tricky. I'm making some big assumptions that a table doesn't exist so that I don't need to do any checking for proper schema or valid data. To restart midstream I think I'll need to add quite a bit of logic around making sure what I'm starting is in a good state for a restart.

My other idea was to create a buffer of what I'm sending in and replying it on failure before continuing on with the stream. That might actually be a decent solution for network hicuups.

In the meantime, the main culprit I think was really, really large batches with the big rows that exist in PostHistory. I've drastically reduced the batch size for the import there. If you get a chance I'm curious if things improve.

@arvindshmicrosoft
Copy link
Contributor Author

Thank you. And just to be clear, I am not hoping for an incremental restart. The expectation would be that the user runs a TRUNCATE TABLE on the specific table(s) which got interrupted earlier, and then the data load starts (on the specified tables) from scratch.

I did try testing a day ago with your first iteration of reduced batch sizes, and sure it seemed to help a bit in the resource-constrained DB case. The worry is that for longer-running loads (like StackOverflow.com) which take several hours across the board, the chances of something going wrong (in the cloud) are relatively more. So in those cases being able to truncate and just start the load for specific table(s) from scratch, would be so much better than having to restart everything. Thanks again for thinking about these feedback items!

@phil-scott-78
Copy link
Owner

Hey @arvindshmicrosoft, I had a few ideas while walking the dog this weekend on how to get restarting to work and so far, I've been pretty happy with the results. I'm buffering the data before the bulk insert statements are sent. If I get an exception, I clear out any data from the server that I successfully wrote and then do the buffered insert again. It's not bullet proof - if more data than I buffered was written, for example, it'll still fail. But for a lot of common issues I think it might increase the odds of success quite a bit.

If you want to check it out, I have a change named buffered-insert with the work in progress. I'd still like to reduce some allocations for the buffer a bit before merging it into main and I also want to try and break it in more creative ways. I only have access to a SQL Server on my local home machine so I'm a bit limited in how I can break it right now, so if you can kick the tires I'd appreciate that!

Animation

I'm still going to add the option you suggested about doing individual tables. I think you are right - for the Stack database it might be needed for success for a lot of people.

@arvindshmicrosoft
Copy link
Contributor Author

Thank you, @phil-scott-78 !
I ran some tests today with the resource-constrained database (0.5 vCores). I threw in a few manual failovers during the data ingestion. While I did not see the hanging behavior, at least for the Posts and PostHistory tables, 500 rows seem missing from each. The app did not return any error or warning about this potential data loss - this could be useful to let the user know.

One more suggestion to help data quality checks: when the final summary is output, can we compare elements read from the XML against the ones sent to the DB? And the "humanized" output in the summary is probably counter-productive, we may want to print the exact number there?

In any case, I really appreciate your great work here!

table_name rows
Votes 2905576
PostHistory 1684540
Comments 869426
Badges 859680
Posts 526593
Users 455763
PostLinks 71837
Tags 2754
PostHistoryTypes 37
VoteTypes 15
PostTypes 8
LinkTypes 2
PostTags 0

The source XML files (Dec 2021 snapshot) have the following line counts. We can deduct 2 from each line count and then expect that to match up with the row count from the above DB.

$ wc -l /d/temp/*.xml
    859682 /d/temp/Badges.xml
    869428 /d/temp/Comments.xml
   1685042 /d/temp/PostHistory.xml
     71839 /d/temp/PostLinks.xml
    527095 /d/temp/Posts.xml
      2756 /d/temp/Tags.xml
    455765 /d/temp/Users.xml
   2905578 /d/temp/Votes.xml

@phil-scott-78
Copy link
Owner

Oh that's a great idea. I love it. I'll see what I can do

@phil-scott-78
Copy link
Owner

Added a quick row count check.
image

Next step is trying to reproduce the missing data. 500 is a pretty specific and helpful number for those two tables - that's the batch size I'm sending in with SqlBulkCopy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants