Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logs from execs/services can get dropped/cutoff #7210

Open
sipsma opened this issue Apr 26, 2024 · 1 comment
Open

Logs from execs/services can get dropped/cutoff #7210

sipsma opened this issue Apr 26, 2024 · 1 comment
Labels
kind/bug Something isn't working

Comments

@sipsma
Copy link
Contributor

sipsma commented Apr 26, 2024

Repro test:

func TestLogDrop(t *testing.T) {
	t.Parallel()

	var logs safeBuffer
	c, ctx := connect(t, dagger.WithLogOutput(&logs))

	_, err := c.Container().
		From(alpineImage).
		WithEnvVariable("BUST", identity.NewID()).
		WithExec([]string{"sh", "-c", `i=0; while [ $i -lt 2000 ]; do echo ` + strings.Repeat("$i", 100) + `; i=$((i + 1)); done`}).
		Sync(ctx)
	require.NoError(t, err)
	require.NoError(t, c.Close())

	scanner := bufio.NewScanner(&logs)
	var i int
	expectedLine := strings.Repeat(strconv.Itoa(i), 100)
	for scanner.Scan() {
		line := scanner.Text()

		// TODO: just for debug
		t.Log(line)

		if strings.HasSuffix(line, expectedLine) {
			i++
			expectedLine = strings.Repeat(strconv.Itoa(i), 100)
		}
	}
	require.NoError(t, scanner.Err())
	require.Equal(t, 2000, i)
}

Never passes locally for me, will make it a few hundred expected lines in before missing one.

Ran into this problem IRL while debugging engine stuff w/ buildkit scheduler debug logs enabled: #7128

Only workaround was to just run a non-nested dev engine, which made the full logs available via docker logs.


I wouldn't personally consider it a bug that we ever drop logs; having some sort of limit in some part of the telemetry pipeline makes perfect sense, however:

  1. I'm not actually sure that's what's happening here, as opposed to logs being missing due to some actual bug
  2. Even if that is what's being hit, it seems like we hit a lot earlier than ideal. The repro above is only writing 2000 lines, which isn't particularly crazy even if they are being written rapidly.
    • The fact that docker logs handles it okay seems particularly telling. Obviously we have a lot more going on with our telemetry, but the fact that I can't get all the logs even just locally feels wrong.
  3. In that PR I linked to above, I also sometimes saw errors like OpenTelemetry error: grpc received message larger than max (approximately that, based on memory), which does feel like a genuine bug.
    • However, I am not seeing that anywhere in the repro above, so I don't know what to make of it. Maybe red herring or maybe the error itself gets dropped sometimes in these cases?

cc @vito

@sipsma sipsma added the kind/bug Something isn't working label Apr 26, 2024
@jedevc
Copy link
Member

jedevc commented May 1, 2024

Could this potentially be related to something like #7227?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants