Delayed Logging #3248

awelzel · 2023-08-24T18:24:26Z

awelzel
Aug 24, 2023
Maintainer

Intro

Provide a generic approach to delay the serialization of a log record such that Zeek scripts have an extended window of time for mutating in-flight log records and final policing. This document list challenges, considerations and three potential ideas.

General objective is to discuss and align use-cases with the ideas, but also inform about challenges and considerations identified and hopefully collect input/feedback for solutions.

Background

When a Log::write() is invoked for a given record, log stream and filter policy hooks are invoked synchronously to veto or modify the in-flight record, but otherwise the record is immediately serialized as "threading vals" and forwarded to the attached log filters without any further option to enrich or mutate the record from Zeek scripts. For very short-lived connections (or short operations) this hinders asynchronous enrichment of given log records or making "slow policy decisions" using, for example, remote lookups.

Existing Implementations and Approaches

SSL Scripts

delay_log / undelay_log: script level implementation using a when statement and hard-coded 15 seconds timeout. Unsure what or how this is used as there are no users.

Notice Framework

email_delay_tokens in support for policy/frameworks/notice/extend-email/hostnames.zeek to wait for lookup_hostname() to complete. However, this applies to email sending functionality and not directly related to the logging framework.

The notice framework also has a de-dup suppression mechanism to avoid reporting notices repeatedly in a given time window.

Replication

Generally, replicating data for enrichment across all workers to have zero-delay access is a workaround.

Fetch as Early as Possible

The alternative to replication is fetching enrichment data in, for example, new_connection() as early as possible in its lifetime and cross fingers the result comes before a given connection has been logged.

Considerations

There are various considerations and decisions around the flexibility of an approach. The following list summarizes those that came up so far:

Granularity of delay decision
- Global decision for all log records across all log streams
- Decision based on log streams
- Delay decision is based on individual records written to a stream or even filter
Granularity of delay duration
- A single global delay duration applied to all streams
- Stream specific delay durations
- Filter specific delay durations
- Record specific delay duration
Stickiness and modifications
- Record logged once delay duration expires
- Allow to finish (release/undelay) a delayed log before its delay duration expires
Delay State
- Is the delay decision attached to a record?
- Is it the Log::write() operation that is delayed (where the record is just part the context).
- Generally the concern is around the very same record being written multiple times to the same or different streams.
Location of delay decision
- Delay configuration only via redef or in zeek_init() handlers at startup
- Delaying Info records using existing events like connection_state_remove()
- Delay decisions limited to existing log policy hooks
- Introduce new logging hooks or callbacks just for delaying
Point of delay execution in logging pipeline
- Delay before or after the existing logging hooks?
- Implies requirement for pre/post delay hooks for additional decision making.

Implementation Ideas

Non-selective: All records of a stream are delayed
Selective: Delay decision is based on a per-record basis

None of the approaches allow an individual per-record delay - all records of a given stream have the same maximum delay duration. The selective approaches allow an explicit undelay/finish to expire the delay duration for a given record early, however.

Per-filter delay decisions are avoided with the assumption that any delay of a log record is useful/productive for all filters attached to a given stream.

Non-selective Stream Delay

For every Log::write() to a stream, serialization of the record is delayed after execution of global stream and filter policy hooks. The delay duration is constant and specific to a log stream. During the delay period, modifications to the log record become visible in the final log. Post delay callbacks for a stream can be registered for final modification and filtering of individual records after the delay period. If the delay queue size of a stream is exceeded, the oldest record is immediately expired to make room in the queue.

Log::set_max_delay_interval(id: ID, delay: interval);
Log::set_max_delay_queue_size(id: ID, queue_size: count);
Log::register_post_delay_cb(id: ID, rec: any, cb: PostDelayCallback)

Selective Delay-In-Policy-Hook

For every Log::write() to a stream, scripts can mark individual log records to be delayed using a new Log::delay() method during global stream and filter policy hooks execution. A non-delayed record is immediately serialized processing continues. Otherwise, the record is delayed by a maximum stream-specific delay duration. A delayed log record can be released before expiry of the delay period using an opaque token and a new Log::delay_finish() method. A simple reference counting scheme exists: Two scripts delaying the same record require two Log::delay_finish() calls. When marking a record for delay, registration of a post delay callback is possible for final modification and filtering of that specific record.

Calling Log::delay() outside of log policy hook execution is a violation. Enforcing the delay decision within hooks allows to frame it as a "policy decision" for a particular Log::write() and provides a defined point at which a delay starts.

Log::delay(id: ID, rec: any, post_delay_cb: PostDelayCallback &optional): DelayToken;
Log::delay_finish(id: ID, rec: any, token: DelayToken);  # Naming candidate Log::undelay()

Selective Delay-Anywhere-Anytime

This approach relaxes the restriction for calling Log::delay() only within log policy hooks. The API otherwise aligns with the previous one. The motivation being a more natural and convenient way to mark a log record for delay in the same way how scripts today extend records. An example would be using Log::delay() within connection_state_remove() for delaying a specific c$conn record.

This approach softens the relation to a concrete Log::write() operation of a given log record instance, introducing question around:

How does the Log::delay() caller know when the delay actually starts? Do they need to?
Promised delays: A Log::write() may not actually happen for something Log::delay() is called on.
Difficult to frame delay as a log policy decision
Composability and explainability for multiple Log::write() of the same record
There are many more places from which a delay can happen with this approach. Good: flexibility. Bad: flexibility, various different patterns possible.

Other Challenges, Details, Quirks

Record mutations and delaying. Reusing a log record instance combined with delaying may cause surprising behavior with existing scripts. The log output of the following sequence is currently well defined, but with log delaying the two output log lines will both have msg suffixed with " duplicate".
```
local info = Info(...);
Log::write(LOG, info);
info$msg += " duplicate";
Log::write(LOG, info);
```

We can likely accept that as a consequence of log delay functionality and general record mutability gotchas. Promotion of copy() patterns for such cases may be a pragmatic workaround. This situation can also happen through the usage of log events.

Execution of path_func and ext_func: Idea would be to run after the delay. May want to compute a different path after delay.
Semantics of multiple Log::write() with "promised delays"
Wanted semantics might either to delay both writes or to just one.
This is explicit for the in-policy-hook approach: There's context for the write.
Same record value is written to different streams: How does one know all streams that may potentially need to be delayed?
Example: zeek-long-connections
Can figure out record type and corresponding Log::ID during policy hook execution.
Potential record coercion during Log::write()
Records observed in a policy hook may not be "original": copied if coercion is required
Behavior possibly better explained with. Work-around is to not rely on the coercion feature and maybe should generally be deprecated.

Example: conn.log and lookup_addr()

The following script gives an example of the second approach "Selective Delay-In-Policy-Hook" to enrich the conn.log with names from an asynchronous lookup_addr() call.

@load base/protocols/conn

redef record Conn::Info += {
	orig_name: string &log &optional;
	resp_name: string &log &optional;
};

hook Log::log_stream_policy(rec: Conn::Info, id: Log::ID)
	{
	if ( id != Conn::LOG )
		return;

	local token = Log::delay(id, rec);

	when [id, rec, token] ( local orig_name = lookup_addr(rec$id$orig_h) )
		{
		rec$orig_name = orig_name;
		if ( rec?$resp_name )
			Log::delay_finish(id, rec, token);
		}
	timeout 190msec
		{
		Reporter::warning(fmt("lookup_addr timeout for %s", rec$id$orig_h));
		}

	when [id, rec, token] ( local resp_name = lookup_addr(rec$id$resp_h) )
		{
		rec$resp_name = resp_name;
		if ( rec?$orig_name )
			Log::delay_finish(id, rec, token);
		}
	timeout 190msec
		{
		Reporter::warning(fmt("lookup_addr timeout for %s", rec$id$resp_h));
		}
	}

event zeek_init()
	{
	Log::set_max_delay_interval(Conn::LOG, 200msec);
	}

awelzel · 2024-02-27T16:03:25Z

awelzel
Feb 27, 2024
Maintainer Author

An initial implementation made it into Zeek 6.2.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delayed Logging #3248

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Delayed Logging #3248

awelzel Aug 24, 2023 Maintainer

Intro

Background

Existing Implementations and Approaches

SSL Scripts

Notice Framework

Replication

Fetch as Early as Possible

Considerations

Implementation Ideas

Non-selective Stream Delay

Selective Delay-In-Policy-Hook

Selective Delay-Anywhere-Anytime

Other Challenges, Details, Quirks

Example: conn.log and lookup_addr()

Replies: 1 comment

awelzel Feb 27, 2024 Maintainer Author

awelzel
Aug 24, 2023
Maintainer

awelzel
Feb 27, 2024
Maintainer Author