[BUG] Race in batchStrategy flushing logs on serverless shutdown #32951

hulkholden · 2025-01-14T11:43:29Z

I've been debugging an issue with a python Cloud Run Job on GCP where the logs fail to appear in datadog even though I can see the logs appear in Cloud Logging.

I believe I've found a race in batch_strategy.go (more below).

Agent Environment
datadog/serverless-init:1.5.3 (not 100% what agent version this maps to)

Describe what happened:
The job runs every 5 minutes and is short-lived (several seconds runtime) if there's no work to do. It prints a couple of summary lines before it exits. 90%+ of the time the logs don't appear in datadog, even though I see them captured by Cloud Logging.

Describe what you expected:
100% of log lines to be captured.

Steps to reproduce the issue:
Run a short-lived python app as a Cloud Run job (see below for a trivial example)

Additional environment details (Operating System, Cloud provider, etc):
GCP Cloud Run, python:3.12-slim-bookworm.

Example app:

import logging

logger = logging.getLogger(__name__)

if __name__ == "__main__":
  logger.info("hello")

Dockerfile:

FROM ghcr.io/astral-sh/uv:python3.12-bookworm-slim AS builder
ENV UV_LINK_MODE=copy
WORKDIR /app
RUN --mount=type=cache,target=/root/.cache/uv \
    --mount=type=bind,source=uv.lock,target=uv.lock \
    --mount=type=bind,source=pyproject.toml,target=pyproject.toml \
    uv sync --frozen --no-install-project --no-dev --no-install-workspace
ADD . /app
RUN --mount=type=cache,target=/root/.cache/uv \
    uv sync --frozen --no-dev --package credit-allocation

FROM python:3.12-slim-bookworm AS runtime

COPY --from=datadog/serverless-init /datadog-init /bin/datadog-init

ENV PYTHONFAULTHANDLER=1
ENV PYTHONHASHSEED=random
ENV PYTHONUNBUFFERED=1
ENV PATH="/app/.venv/bin:$PATH"

COPY --from=builder /app /app
WORKDIR /app/apps/foo

ARG SERVICE="foo"
ARG VERSION="unknown"

ENV DD_SERVICE=$SERVICE
ENV DD_VERSION=$VERSION
ENV DD_GIT_COMMIT_SHA=$VERSION

LABEL com.datadoghq.tags.service="$SERVICE"
LABEL com.datadoghq.tags.version="$VERSION"

ENTRYPOINT ["/bin/datadog-init"]

CMD [ "ddtrace-run", "python", "app.py" ]

Additional env vars:

DD_ENV=dev
DD_TRACE_ENABLED=True
DD_LOGS_ENABLED=True
DD_TRACE_SAMPLE_RATE=0.1
DD_LOG_LEVEL=DEBUG

Debugging

Here are the logs of interest:

INFO 2025-01-14T10:20:55.038304Z <my application logs of interest (2 lines)>
DEFAULT 2025-01-14T10:20:55.779608Z 2025-01-14 10:20:55 UTC | SERVERLESS_INIT | INFO | Triggering a flush in the logs-agent
DEFAULT 2025-01-14T10:20:55.779635Z 2025-01-14 10:20:55 UTC | SERVERLESS_INIT | DEBUG | Flush in the logs-agent done.
DEFAULT 2025-01-14T10:20:55.779643Z 2025-01-14 10:20:55 UTC | SERVERLESS_INIT | DEBUG | Send messages for pipeline logs (msg_count:2, content_size=1101, avg_msg_size=550.50)

Here's what I think is happening:

My process logs and exits. serverless-init/main.go calls lastFlush which in turn calls logsAgent.Flush().
Triggering a flush in the logs-agent / Flush in the logs-agent done. - this appears to be the logging from logAgent.Flush().
Send messages for pipeline logs - my logs messages appear! I think this is triggered by writing to flushChan in Pipeline.Flush(). This causes the goroutine running in batchStrategy.Start() to call flushBuffer(), which logs "Send messages for pipeline logs".

I believe the bug is here:

// Flush flushes synchronously the processor and sender managed by this pipeline.
func (p *Pipeline) Flush(ctx context.Context) {
	p.flushChan <- struct{}{}
	p.processor.Flush(ctx) // flush messages in the processor into the sender

	if p.serverless {
		// Wait for the logs sender to finish sending payloads to all destinations before allowing the flush to finish
		p.flushWg.Wait()
	}
}

flushChan triggers a flush, but there's no guarantee the goroutine in batchStrategy.Start() will complete its flush before the call to p.processor.Flush() starts.

That means batchStrategy ends up calling sendMessages() and enqueuing the batched messages after the pipeline flushed for the final time.

The text was updated successfully, but these errors were encountered:

hulkholden added the team/triage label Jan 14, 2025

github-actions bot added the team/serverless label Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Race in batchStrategy flushing logs on serverless shutdown #32951

[BUG] Race in batchStrategy flushing logs on serverless shutdown #32951

hulkholden commented Jan 14, 2025 •

edited

Loading

[BUG] Race in batchStrategy flushing logs on serverless shutdown #32951

[BUG] Race in batchStrategy flushing logs on serverless shutdown #32951

Comments

hulkholden commented Jan 14, 2025 • edited Loading

hulkholden commented Jan 14, 2025 •

edited

Loading