Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test run hangs #10394

Closed
jeffkl opened this issue Oct 18, 2024 · 6 comments
Closed

Test run hangs #10394

jeffkl opened this issue Oct 18, 2024 · 6 comments

Comments

@jeffkl
Copy link

jeffkl commented Oct 18, 2024

The NuGet.Client repo has a CI that occasionally experiences a test hang. I've gotten diagnostic logs and am not sure how to proceed.

The main test host log shows this for 12 minutes:

TpTrace Verbose: 0 : 3760, 13, 2024/10/18, 19:56:21.066, 8745895652, vstest.console.dll, TcpClientExtensions.MessageLoopAsync: Polling on remoteEndPoint: 127.0.0.1:51210 localEndPoint: 127.0.0.1:51209 after 1004 ms
...
TpTrace Verbose: 0 : 3760, 13, 2024/10/18, 20:08:39.866, 16133844229, vstest.console.dll, TcpClientExtensions.MessageLoopAsync: Polling on remoteEndPoint: 127.0.0.1:51210 localEndPoint: 127.0.0.1:51209 after 1015 ms

And one of the hosts shows this for 7minutes:

TpTrace Verbose: 0 : 4508, 9, 2024/10/18, 19:55:39.666, 8331899048, testhost.dll, TcpClientExtensions.MessageLoopAsync: Polling on remoteEndPoint: [::ffff:127.0.0.1]:51209 localEndPoint: [::ffff:127.0.0.1]:51210
...
TpTrace Verbose: 0 : 4508, 9, 2024/10/18, 20:08:40.445, 16139625412, testhost.dll, TcpClientExtensions.MessageLoopAsync: Polling on remoteEndPoint: [::ffff:127.0.0.1]:51209 localEndPoint: [::ffff:127.0.0.1]:51210

All of the other hosts show them gracefully exiting:

TpTrace Information: 0 : 10988, 1, 2024/10/18, 19:55:05.806, 7993301964, testhost.dll, Testhost process exiting.
TpTrace Information: 0 : 10988, 7, 2024/10/18, 19:55:05.806, 7993303632, testhost.dll, SocketClient.PrivateStop: Stop communication from server endpoint: 127.0.0.1:050821, error:
TpTrace Verbose: 0 : 10988, 7, 2024/10/18, 19:55:05.806, 7993305167, testhost.dll, LengthPrefixCommunicationChannel.Dispose: Dispose reader and writer.
TpTrace Verbose: 0 : 10988, 7, 2024/10/18, 19:55:05.806, 7993305526, testhost.dll, TcpClientExtensions.MessageLoopAsync: exiting MessageLoopAsync remoteEndPoint: [::ffff:127.0.0.1]:50821 localEndPoint: [::ffff:127.0.0.1]:50822

Eventually Azure DevOps times out the test run after 15 total minutes.

Logs: vstest.diag.zip

Failed build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=847668&view=logs&s=feb4a670-8d1a-579f-6453-0b5a06576f62

Any advice on how to investigate this further would be greatly appreciated.

Here is our RunSettings: https://github.com/NuGet/NuGet.Client/blob/dev/build/xunit.runsettings

We're currently using .NET SDK 8.0.3xx, maybe I should try 8.0.4xx?

@nohwnd
Copy link
Member

nohwnd commented Oct 21, 2024

Hello,
those entries mean that vstest.console is waiting for testhost to finish, and testhost is waiting for a test to finish. In this case NuGet.XPlat.FuncTest.ListPackageTests.ListPackage_WithPrivateHttpSourceCredentialServiceIsInvokedAsNeeded_Succeeds, so probably that test is blocking on server, and won't finish.

You can add --blame-hang-timeout 5min to your run to collect hang dump, or you can do the same using these runsettings. Please notice that the settings go OUTSIDE of the RunConfiguration section.:

<RunSettings>
  <RunConfiguration>
   <!-- ... -->
  </RunConfiguration>
  <LoggerRunSettings>
    <Loggers>
      <Logger friendlyName="blame" enabled="True" />
    </Loggers>
  </LoggerRunSettings>
  <DataCollectionRunSettings>
    <DataCollectors>
      <DataCollector friendlyName="blame" enabled="True">
        <Configuration>
          <CollectDumpOnTestSessionHang TestTimeout="5min" HangDumpType="Full" />
        </Configuration>
      </DataCollector>
    </DataCollectors>
  </DataCollectionRunSettings>
</RunSettings>

@jeffkl
Copy link
Author

jeffkl commented Oct 21, 2024

@nohwnd thank you. It was not obvious to me that the testhost was waiting for ListPackage_WithPrivateHttpSourceCredentialServiceIsInvokedAsNeeded_Succeeds. Can you let me know you figured that out so I can look for that in the future?

We used to have blame enabled but I found it was only capturing dumps for the main vstest.console.exe process and not the out-of-proc test host. Am I mistaken? Does blame capture dumps from whatever host was actually running the test?

@nohwnd
Copy link
Member

nohwnd commented Oct 23, 2024

Blame should not capture dumps of vstest.console. vstest.console will start datacollector and testhost, and datacollector will observe testhost. If you are seeing crash dumps from vstest.console then you are either using a dumper that is not ours. Or some other solution, e.g. local dumps + custom data collector to attach them to test run.

The test I've figured out by looking at the log and checking how many started and stopped tests there are and then doing some light parsing.

cat Downloads\vstest.diag\vstest.diag.host.24-10-18_19-55-21_22032_15.log | 
% { $s = @{ start = @() ; end = @() } } { if ($_ -like "*RecordStart*") { $s.start += ($_ -split "RecordStart: starting test: ")[1] } ; if ($_ -like "*RecordEnd*") { $s.end += (($_ -split "RecordEnd: test: ")[1] -replace " execution completed", "") }} { $s }

$r.start | where { $_ -notin $s.end }

@jeffkl
Copy link
Author

jeffkl commented Oct 23, 2024

Okay that's very helpful, thanks @nohwnd! I'm working on enabling the blame hang dump collector at NuGet/NuGet.Client#6115 and its already reported on a hanging test.

Last question: The blame collector attaches the memory dumps to the test run and I can download them. Is the only way to get symbols via uploading the built artifacts on the CI machine? I tried building the same sources locally but VS would not load the symbols.

@nohwnd
Copy link
Member

nohwnd commented Oct 23, 2024

There is checkbox in VS Debugging options that says "require symbols to exactly match the source", I normally uncheck this. I don't know how you can improve your chances of seeing the whole stack. E.g. by enabling embedded portable symbols or something like that. Googling says that you can call lml after analyzing to dump to see which symbols the app tried to load.

@jeffkl
Copy link
Author

jeffkl commented Oct 23, 2024

Okay thanks for the pointers!

@jeffkl jeffkl closed this as not planned Won't fix, can't repro, duplicate, stale Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants