Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: tests on Windows can sporadically fail #465

Open
lread opened this issue Jun 23, 2022 · 8 comments
Open

ci: tests on Windows can sporadically fail #465

lread opened this issue Jun 23, 2022 · 8 comments

Comments

@lread
Copy link
Collaborator

lread commented Jun 23, 2022

Currently

It is not unusual for jobs on Windows to fail on GitHub Actions.
A re-run of any failed jobs usually succeeds.

Impact

It is a bit annoying to have to rerun these sporadic failed jobs.

Typical Symptom

The typical symptom is a SocketTimeoutException exception, for example:

Testing etaoin.api-test
Testing with browsers: [:firefox]
Error: Exception in thread "main" java.net.SocketTimeoutException: Read timed out
	at java.base/java.net.SocketInputStream.socketRead0(Native Method)
	at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
	at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
	at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
	at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
	at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:280)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
	at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
	at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
	at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:157)
	at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
	at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
	at clj_http.core$request.invokeStatic(core.clj:621)
	at clj_http.core$request.invoke(core.clj:546)
	at clj_http.core$request.invokeStatic(core.clj:547)
	at clj_http.core$request.invoke(core.clj:546)
	at clojure.lang.Var.invoke(Var.java:381)
	at clj_http.client$wrap_request_timing$fn__5700.invoke(client.clj:1083)
	at clj_http.headers$wrap_header_map$fn__4561.invoke(headers.clj:152)
	at clj_http.client$wrap_query_params$fn__5580.invoke(client.clj:817)
	at clj_http.client$wrap_basic_auth$fn__5586.invoke(client.clj:840)
	at clj_http.client$wrap_oauth$fn__5591.invoke(client.clj:857)
	at clj_http.client$wrap_user_info$fn__5600.invoke(client.clj:877)
	at clj_http.client$wrap_url$fn__5682.invoke(client.clj:1035)
	at clj_http.client$wrap_decompression$fn__5327.invoke(client.clj:416)
	at clj_http.client$wrap_input_coercion$fn__5492.invoke(client.clj:630)
	at clj_http.client$wrap_additional_header_parsing$fn__5517.invoke(client.clj:685)
	at clj_http.client$wrap_output_coercion$fn__5479.invoke(client.clj:574)
	at clj_http.client$wrap_exceptions$fn__5274.invoke(client.clj:250)
	at clj_http.client$wrap_accept$fn__5532.invoke(client.clj:728)
	at clj_http.client$wrap_accept_encoding$fn__5539.invoke(client.clj:750)
	at clj_http.client$wrap_content_type$fn__5526.invoke(client.clj:711)
	at clj_http.client$wrap_form_params$fn__5647.invoke(client.clj:959)
	at clj_http.client$wrap_nested_params$fn__5668.invoke(client.clj:993)
	at clj_http.client$wrap_flatten_nested_params$fn__5677.invoke(client.clj:1017)
	at clj_http.client$wrap_method$fn__5605.invoke(client.clj:893)
	at clj_http.cookies$wrap_cookies$fn__3434.invoke(cookies.clj:128)
	at clj_http.links$wrap_links$fn__4955.invoke(links.clj:63)
	at clj_http.client$wrap_unknown_host$fn__5685.invoke(client.clj:1046)
	at etaoin.impl.client$call.invokeStatic(client.cljc:101)
	at etaoin.impl.client$call.invoke(client.cljc:77)
	at etaoin.api$execute.invokeStatic(api.clj:228)
	at etaoin.api$execute.invoke(api.clj:202)
	at etaoin.api$create_session.invokeStatic(api.clj:255)
	at etaoin.api$create_session.doInvoke(api.clj:245)
	at clojure.lang.RestFn.invoke(RestFn.java:423)
	at etaoin.api$_connect_driver.invokeStatic(api.clj:3569)
	at etaoin.api$_connect_driver.doInvoke(api.clj:3484)
	at clojure.lang.RestFn.invoke(RestFn.java:423)
	at etaoin.api$boot_driver.invokeStatic(api.clj:3609)
	at etaoin.api$boot_driver.invoke(api.clj:3593)
	at etaoin.api_test$fixture_browsers$fn__8111.invoke(api_test.clj:56)
	at etaoin.api_test$fixture_browsers.invokeStatic(api_test.clj:56)
	at etaoin.api_test$fixture_browsers.invoke(api_test.clj:53)
	at clojure.test$compose_fixtures$fn__9203$fn__9204.invoke(test.clj:693)
	at clojure.test$default_fixture.invokeStatic(test.clj:686)
	at clojure.test$default_fixture.invoke(test.clj:682)
	at clojure.test$compose_fixtures$fn__9203.invoke(test.clj:693)
	at clojure.test$test_vars$fn__9235.invoke(test.clj:734)
	at etaoin.api_test$report_browsers.invokeStatic(api_test.clj:70)
	at etaoin.api_test$report_browsers.invoke(api_test.clj:68)
	at clojure.test$compose_fixtures$fn__9203$fn__9204.invoke(test.clj:693)
	at clojure.test$default_fixture.invokeStatic(test.clj:686)
	at clojure.test$default_fixture.invoke(test.clj:682)
	at clojure.test$compose_fixtures$fn__9203.invoke(test.clj:693)
	at clojure.test$test_vars.invokeStatic(test.clj:730)
	at clojure.test$test_all_vars.invokeStatic(test.clj:736)
	at clojure.test$test_ns.invokeStatic(test.clj:757)
	at clojure.test$test_ns.invoke(test.clj:742)
	at clojure.core$map$fn__5587.invoke(core.clj:2747)
	at clojure.lang.LazySeq.sval(LazySeq.java:40)
	at clojure.lang.LazySeq.seq(LazySeq.java:49)
	at clojure.lang.Cons.next(Cons.java:39)
	at clojure.lang.RT.boundedLength(RT.java:1785)
	at clojure.lang.RestFn.applyTo(RestFn.java:130)
	at clojure.core$apply.invokeStatic(core.clj:659)
	at clojure.test$run_tests.invokeStatic(test.clj:767)
	at clojure.test$run_tests.doInvoke(test.clj:767)
	at clojure.lang.RestFn.applyTo(RestFn.java:137)
	at clojure.core$apply.invokeStatic(core.clj:657)
	at clojure.core$apply.invoke(core.clj:652)
	at cognitect.test_runner$test.invokeStatic(test_runner.clj:74)
	at cognitect.test_runner$test.invoke(test_runner.clj:62)
	at cognitect.test_runner$_main.invokeStatic(test_runner.clj:127)
	at cognitect.test_runner$_main.doInvoke(test_runner.clj:115)
	at clojure.lang.RestFn.applyTo(RestFn.java:137)
	at clojure.lang.Var.applyTo(Var.java:702)
	at clojure.core$apply.invokeStatic(core.clj:657)
	at clojure.main$main_opt.invokeStatic(main.clj:317)
	at clojure.main$main_opt.invoke(main.clj:313)
	at clojure.main$main.invokeStatic(main.clj:424)
	at clojure.main$main.doInvoke(main.clj:387)
	at clojure.lang.RestFn.applyTo(RestFn.java:137)
	at clojure.lang.Var.applyTo(Var.java:702)
	at clojure.main.main(main.java:37)

Thoughts

This is different from #464. The default socket timeout is 60 seconds.

I've not done much verification to try to reproduce this one on my Windows VM yet.

My wild stab guess is that the WebDriver process has maybe crashed, but I dunno yet.
Maybe we are asking the WebDriver to create a session before it is ready to do so.

We could try this idea from #464:

When Etaoin is the one who has launched the WebDriver process it could also check if the process actually still seems to be running (I don't think we currently do this?).

@lread
Copy link
Collaborator Author

lread commented Jun 27, 2022

It feels to me like maybe the Windows platform, for whatever reasons, is simply less stable.

Sometimes when GitHub Actions tries to launch an app, we'll get a -1073741502 exit code, example raw log:

2022-06-26T13:46:31.9821729Z TASK tools-versions done.
2022-06-26T13:46:32.3019582Z ##[group]Run bb test unit --platform=jvm
2022-06-26T13:46:32.3019965Z �[36;1mbb test unit --platform=jvm�[0m
2022-06-26T13:46:32.3078141Z shell: C:\Program Files\PowerShell\7\pwsh.EXE -command ". '{0}'"
2022-06-26T13:46:32.3078439Z env:
2022-06-26T13:46:32.3078785Z   JAVA_HOME: C:\hostedtoolcache\windows\Java_Temurin-Hotspot_jdk\11.0.15-10\x64
2022-06-26T13:46:32.3079128Z ##[endgroup]
2022-06-26T13:46:32.3253202Z ##[error]Process completed with exit code -1073741502.

This translates to NTSTATUS code 0xC0000142 https://stackoverflow.com/a/38246636 which means DLL initialization failed. I see this sometimes for other projects too.

This is sporadic so I could maybe have GitHub Actions retry a command if it returns this exit code, but I am not sure what I am working-around here. Windows? GitHub Actions Windows platform? Some issue with the binaries I am launching? I'll probably never know.

@lread
Copy link
Collaborator Author

lread commented Jun 30, 2022

It is odd that the socket timeout exception seems to only occur on Windows.

I have seen it when talking to:

  • chromedriver from the JVM
  • geckodriver (firefox) from the JVM and from babashka.

@lread
Copy link
Collaborator Author

lread commented Jul 2, 2022

A bit of a long shot, but maybe related?:
actions/runner-images#1187

Specifically this might help?
actions/runner-images#1187 (comment)

lread added a commit to lread/etaoin that referenced this issue Jul 2, 2022
lread added a commit that referenced this issue Jul 2, 2022
@lread
Copy link
Collaborator Author

lread commented Jul 2, 2022

I'll report on any future Windows SocketTimeoutException occurrences here.
If we don't see any for a while, will close this issue.

@lread
Copy link
Collaborator Author

lread commented Aug 26, 2022

While working on clj-http-lite I've noticed that process-launch->available-connection on Windows can sometimes be extremely slow. On ubuntu ~5s is normal, but on Windows ~10s and then sometimes a whopping 90s.

@lread
Copy link
Collaborator Author

lread commented Mar 11, 2023

As of this writing, we are still getting many sporadic failures on GitHub Actions Windows.

Currently our default *wait-timeout* is global for all platforms. Currently 7 seconds.
I'll think about ways to make this more configurable. Maybe allow for a fn here that can vary based on OS type, CI or whatever.

Or maybe just letting the user bind it to whatever is good enough.
Will see how I can apply a larger default for CI + Windows tests only.

@lread
Copy link
Collaborator Author

lread commented Apr 30, 2023

Note that a socket timeout exception can mean that the WebDriver process is hung/failing somehow. In these cases, it is valuable to see the WebDriver logs.

By default, we do not capture WebDriver logs, they get sent to /dev/null.

For our tests, it might be interesting to capture WebDriver logs but dump them only for failing tests.

It is probably interesting to enable trace/debug-level logging, although a con of doing so is that this could potentially impact WebDriver behaviour. I'm probably willing to take that risk.

If we go down this road, we also don't want to pollute a system with a bunch of WebDriver log files. So some strategy to either not do that, or easily clean up would be necessary.

@lread
Copy link
Collaborator Author

lread commented May 16, 2024

As of 2024-05-16 Windows is still very flaky. It is normal to have to rerun failing Windows jobs to get a successful CI run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant