Delayed Logging
#3248
Replies: 1 comment
-
An initial implementation made it into Zeek 6.2. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Intro
Provide a generic approach to delay the serialization of a log record such that Zeek scripts have an extended window of time for mutating in-flight log records and final policing. This document list challenges, considerations and three potential ideas.
General objective is to discuss and align use-cases with the ideas, but also inform about challenges and considerations identified and hopefully collect input/feedback for solutions.
Background
When a
Log::write()
is invoked for a given record, log stream and filter policy hooks are invoked synchronously to veto or modify the in-flight record, but otherwise the record is immediately serialized as "threading vals" and forwarded to the attached log filters without any further option to enrich or mutate the record from Zeek scripts. For very short-lived connections (or short operations) this hinders asynchronous enrichment of given log records or making "slow policy decisions" using, for example, remote lookups.Existing Implementations and Approaches
SSL Scripts
delay_log / undelay_log: script level implementation using a when statement and hard-coded 15 seconds timeout. Unsure what or how this is used as there are no users.
Notice Framework
email_delay_tokens in support for policy/frameworks/notice/extend-email/hostnames.zeek to wait for lookup_hostname() to complete. However, this applies to email sending functionality and not directly related to the logging framework.
The notice framework also has a de-dup suppression mechanism to avoid reporting notices repeatedly in a given time window.
Replication
Generally, replicating data for enrichment across all workers to have zero-delay access is a workaround.
Fetch as Early as Possible
The alternative to replication is fetching enrichment data in, for example,
new_connection()
as early as possible in its lifetime and cross fingers the result comes before a given connection has been logged.Considerations
There are various considerations and decisions around the flexibility of an approach. The following list summarizes those that came up so far:
Log::write()
operation that is delayed (where the record is just part the context).Implementation Ideas
None of the approaches allow an individual per-record delay - all records of a given stream have the same maximum delay duration. The selective approaches allow an explicit undelay/finish to expire the delay duration for a given record early, however.
Per-filter delay decisions are avoided with the assumption that any delay of a log record is useful/productive for all filters attached to a given stream.
Non-selective Stream Delay
For
every Log::write()
to a stream, serialization of the record is delayed after execution of global stream and filter policy hooks. The delay duration is constant and specific to a log stream. During the delay period, modifications to the log record become visible in the final log. Post delay callbacks for a stream can be registered for final modification and filtering of individual records after the delay period. If the delay queue size of a stream is exceeded, the oldest record is immediately expired to make room in the queue.Selective Delay-In-Policy-Hook
For every
Log::write()
to a stream, scripts can mark individual log records to be delayed using a newLog::delay()
method during global stream and filter policy hooks execution. A non-delayed record is immediately serialized processing continues. Otherwise, the record is delayed by a maximum stream-specific delay duration. A delayed log record can be released before expiry of the delay period using an opaque token and a newLog::delay_finish()
method. A simple reference counting scheme exists: Two scripts delaying the same record require twoLog::delay_finish()
calls. When marking a record for delay, registration of a post delay callback is possible for final modification and filtering of that specific record.Calling
Log::delay()
outside of log policy hook execution is a violation. Enforcing the delay decision within hooks allows to frame it as a "policy decision" for a particularLog::write()
and provides a defined point at which a delay starts.Selective Delay-Anywhere-Anytime
This approach relaxes the restriction for calling
Log::delay()
only within log policy hooks. The API otherwise aligns with the previous one. The motivation being a more natural and convenient way to mark a log record for delay in the same way how scripts today extend records. An example would be usingLog::delay()
withinconnection_state_remove()
for delaying a specificc$conn
record.This approach softens the relation to a concrete
Log::write()
operation of a given log record instance, introducing question around:Log::delay()
caller know when the delay actually starts? Do they need to?Log::write()
may not actually happen for somethingLog::delay()
is called on.Other Challenges, Details, Quirks
msg
suffixed with " duplicate".We can likely accept that as a consequence of log delay functionality and general record mutability gotchas. Promotion of
copy()
patterns for such cases may be a pragmatic workaround. This situation can also happen through the usage of log events.Execution of path_func and ext_func: Idea would be to run after the delay. May want to compute a different path after delay.
Semantics of multiple
Log::write()
with "promised delays"Wanted semantics might either to delay both writes or to just one.
This is explicit for the in-policy-hook approach: There's context for the write.
Same record value is written to different streams: How does one know all streams that may potentially need to be delayed?
Example: zeek-long-connections
Can figure out
record type
and correspondingLog::ID
during policy hook execution.Potential record coercion during
Log::write()
Records observed in a policy hook may not be "original": copied if coercion is required
Behavior possibly better explained with. Work-around is to not rely on the coercion feature and maybe should generally be deprecated.
Example: conn.log and lookup_addr()
The following script gives an example of the second approach "Selective Delay-In-Policy-Hook" to enrich the
conn.log
with names from an asynchronouslookup_addr()
call.Beta Was this translation helpful? Give feedback.
All reactions