Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Race in batchStrategy flushing logs on serverless shutdown #32951

Open
hulkholden opened this issue Jan 14, 2025 · 0 comments
Open

[BUG] Race in batchStrategy flushing logs on serverless shutdown #32951

hulkholden opened this issue Jan 14, 2025 · 0 comments

Comments

@hulkholden
Copy link

hulkholden commented Jan 14, 2025

I've been debugging an issue with a python Cloud Run Job on GCP where the logs fail to appear in datadog even though I can see the logs appear in Cloud Logging.

I believe I've found a race in batch_strategy.go (more below).

Agent Environment
datadog/serverless-init:1.5.3 (not 100% what agent version this maps to)

Describe what happened:
The job runs every 5 minutes and is short-lived (several seconds runtime) if there's no work to do. It prints a couple of summary lines before it exits. 90%+ of the time the logs don't appear in datadog, even though I see them captured by Cloud Logging.

Describe what you expected:
100% of log lines to be captured.

Steps to reproduce the issue:
Run a short-lived python app as a Cloud Run job (see below for a trivial example)

Additional environment details (Operating System, Cloud provider, etc):
GCP Cloud Run, python:3.12-slim-bookworm.

Example app:

import logging

logger = logging.getLogger(__name__)

if __name__ == "__main__":
  logger.info("hello")

Dockerfile:

FROM ghcr.io/astral-sh/uv:python3.12-bookworm-slim AS builder
ENV UV_LINK_MODE=copy
WORKDIR /app
RUN --mount=type=cache,target=/root/.cache/uv \
    --mount=type=bind,source=uv.lock,target=uv.lock \
    --mount=type=bind,source=pyproject.toml,target=pyproject.toml \
    uv sync --frozen --no-install-project --no-dev --no-install-workspace
ADD . /app
RUN --mount=type=cache,target=/root/.cache/uv \
    uv sync --frozen --no-dev --package credit-allocation

FROM python:3.12-slim-bookworm AS runtime

COPY --from=datadog/serverless-init /datadog-init /bin/datadog-init

ENV PYTHONFAULTHANDLER=1
ENV PYTHONHASHSEED=random
ENV PYTHONUNBUFFERED=1
ENV PATH="/app/.venv/bin:$PATH"

COPY --from=builder /app /app
WORKDIR /app/apps/foo

ARG SERVICE="foo"
ARG VERSION="unknown"

ENV DD_SERVICE=$SERVICE
ENV DD_VERSION=$VERSION
ENV DD_GIT_COMMIT_SHA=$VERSION

LABEL com.datadoghq.tags.service="$SERVICE"
LABEL com.datadoghq.tags.version="$VERSION"

ENTRYPOINT ["/bin/datadog-init"]

CMD [ "ddtrace-run", "python", "app.py" ]

Additional env vars:

DD_ENV=dev
DD_TRACE_ENABLED=True
DD_LOGS_ENABLED=True
DD_TRACE_SAMPLE_RATE=0.1
DD_LOG_LEVEL=DEBUG

Debugging

Here are the logs of interest:

INFO 2025-01-14T10:20:55.038304Z <my application logs of interest (2 lines)>
DEFAULT 2025-01-14T10:20:55.779608Z 2025-01-14 10:20:55 UTC | SERVERLESS_INIT | INFO | Triggering a flush in the logs-agent
DEFAULT 2025-01-14T10:20:55.779635Z 2025-01-14 10:20:55 UTC | SERVERLESS_INIT | DEBUG | Flush in the logs-agent done.
DEFAULT 2025-01-14T10:20:55.779643Z 2025-01-14 10:20:55 UTC | SERVERLESS_INIT | DEBUG | Send messages for pipeline logs (msg_count:2, content_size=1101, avg_msg_size=550.50)

Here's what I think is happening:

  1. My process logs and exits. serverless-init/main.go calls lastFlush which in turn calls logsAgent.Flush().
  2. Triggering a flush in the logs-agent / Flush in the logs-agent done. - this appears to be the logging from logAgent.Flush().
  3. Send messages for pipeline logs - my logs messages appear! I think this is triggered by writing to flushChan in Pipeline.Flush(). This causes the goroutine running in batchStrategy.Start() to call flushBuffer(), which logs "Send messages for pipeline logs".

I believe the bug is here:

// Flush flushes synchronously the processor and sender managed by this pipeline.
func (p *Pipeline) Flush(ctx context.Context) {
	p.flushChan <- struct{}{}
	p.processor.Flush(ctx) // flush messages in the processor into the sender

	if p.serverless {
		// Wait for the logs sender to finish sending payloads to all destinations before allowing the flush to finish
		p.flushWg.Wait()
	}
}

flushChan triggers a flush, but there's no guarantee the goroutine in batchStrategy.Start() will complete its flush before the call to p.processor.Flush() starts.

That means batchStrategy ends up calling sendMessages() and enqueuing the batched messages after the pipeline flushed for the final time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant