AI Disclaimer: I did use AI (claude) to draft this document. Feel free to skim! Most of the ideas/concepts here are my own or belong to those tagged. I have reviewed this substantially.
Related issues/PRs:
Background
Snakemake's logging has been through a few iterations. The original system was a custom Logger singleton with a self.log_handler callback list and a monolithic text_handler method. Every log message was a dict with a level string — "job_info", "job_error", "shellcmd", "progress", etc. — and text_handler was a giant if/elif chain dispatching on those strings:
level = msg["level"]
if level == "job_info" and not self.is_quiet_about("rules"):
...
elif level == "job_error":
...
elif level == "shellcmd":
if self.printshellcmds:
...
elif level == "job_finished" and not self.is_quiet_about("progress"):
...
When I refactored logging to use standard Python logging and added the logger plugin interface, I created the LogEvent enum by directly mapping these old level strings to enum values. "job_info" became LogEvent.JOB_INFO. "job_error" became LogEvent.JOB_ERROR. "shellcmd" became LogEvent.SHELLCMD. This was the pragmatic choice at the time — it preserved existing behavior and let me get the plugin interface shipped without rethinking everything at once. But in retrospect I should have taken the time to rethink the event boundaries — JOB_INFO and JOB_STARTED could have been unified, groups didn't need their own event types, and the emission pattern (building bespoke extra dicts with None as the message) was carried over from the old system when it didn't need to be.
PR #32 adds typed dataclasses for parsing LogRecord payloads, and @jlumpe has contributed significantly to the design (the LogEventData base class, from_record()/from_extra() defaults, alias handling) and raised the plugin base refactor discussion in #44. But the dataclasses are still organized around those same legacy event types.
This RFC proposes redesigning the event model now that we have a clearer picture of what plugins and monitoring tools actually need. Most of the issues below trace directly back to that 1:1 mapping I did from the old text_handler levels. Before moving forward — I want to gather thoughts and feedback.
Event model
The problem
The current event types mirror the old text_handler dispatch, not the domain. A job's lifecycle is split across JOB_INFO, JOB_STARTED, JOB_FINISHED, and JOB_ERROR — four types with different field shapes that plugins have to correlate. Groups add two more. Emission is scattered: Job.log_info() emits JOB_INFO, then a separate SHELLCMD for the command, then separate logger.info() calls for checkpoint notes. One conceptual moment ("this job is about to run") produces 3-4 log records.
Call sites build bespoke dicts (extra=dict(event=LogEvent.JOB_INFO, jobid=..., ...)) with inconsistent field names (jobid vs job_id, groupid vs group_id). Structured events get emitted with None as the message — the real content is in extra — which is why ColorizingTextHandler.emit() has a special guard against it. I carried this pattern over from the old system — it worked, but it means there's no standard way for a handler to get a useful string representation of an event without reimplementing the formatter's logic.
Proposed: 8 event types replacing 14
Replace 14 LogEvent values with 8, organized around domain concepts:
| Current |
Proposed |
What changes |
WORKFLOW_STARTED, RUN_INFO, RESOURCES_INFO, RULEGRAPH |
WORKFLOW |
One type, status field. Stats, resources, rulegraph are fields on the started event. needs_rulegraph goes away. |
JOB_INFO, JOB_STARTED, JOB_FINISHED, JOB_ERROR |
JOB |
One type, status field. JOB_STARTED batch notification becomes a plain log message. |
GROUP_INFO, GROUP_ERROR |
(into JOB) |
Jobs carry an optional group_id. No separate group events. |
SHELLCMD from log_info |
(into JOB) |
Shell command is a field on JobDescription. |
SHELLCMD from shell.__new__ |
SHELLCMD |
Stays — actual runtime command execution. |
DEBUG_DAG |
DAG |
DAG resolution at debug level + lifecycle at info level. |
PROGRESS |
PROGRESS |
Unchanged. |
ERROR |
ERROR |
Non-job exceptions. |
| (new) |
DEPLOYMENT |
Conda create, container pull, env module activate. ~20 currently unstructured calls. |
| (new) |
STORAGE |
File retrieve/upload/cleanup. ~10 currently unstructured calls. |
DEPLOYMENT and STORAGE address #49 — structured info about environments, storage operations, etc. that are currently just logger.info() strings no plugin can parse.
Event structure
All events inherit from LogEventData, which carries a workflow_id field. This gets stamped automatically by EventPromotingFilter so call sites don't need to pass it — every event is associated with the workflow run that produced it.
Complex events use nested dataclasses. Simpler events (DagEvent, ShellCmdEvent, ProgressEvent) are flat.
JobEvent has identity fields (job_id, rule_name, group_id, status) plus two optional nested objects:
JobDescription — static info: inputs, outputs, wildcards, resources, shellcmd, benchmark data, etc. Populated on info, finished, and error so each event is self-contained.
JobResult — execution result from the executor: exit code, failure mode, external job ID, execution environment.
WorkflowEvent has identity + status plus:
WorkflowDescription — job count stats, resources, rulegraph, host info. Populated on started.
DeploymentEvent is designed around the new snakemake-interface-software-deployment-plugins. SDM plugins have clear lifecycle phases (pin → cache → deploy → activate → remove) and a common_settings.provides value identifying the provider. DeploymentEvent uses these directly:
@dataclass
class DeploymentEvent(LogEventData):
provider: str = "" # from common_settings.provides: "conda", "container", etc.
action: str = "" # "pin", "cache", "deploy", "activate", "remove", "check"
spec: Optional[str] = None
detail: Optional[str] = None
Forward-compatible with any future SDM plugin — a monitoring tool sees provider="conda", action="deploy" and can render it however it wants.
ErrorEvent is for non-job exceptions logged through print_exception(). Snakemake's exception hierarchy already carries workflow context (RuleException has .rule, .lineno, .filename, etc.) but print_exception() flattens it all into a formatted string. ErrorEvent extracts the fields plugins actually need:
@dataclass
class ErrorEvent(LogEventData):
message: str = ""
exception_type: str = ""
rule_name: Optional[str] = None
snakefile: Optional[str] = None
lineno: Optional[int] = None
@classmethod
def from_exception(cls, ex, linemaps=None): ...
from_exception() lives in snakemake core (it knows about RuleException vs WorkflowError). The dataclass lives in the interface package. message carries the full formatted output from format_exception_to_string(). I'm intentionally keeping this minimal — things like affected files or nested exception trees are in the message and can be promoted to typed fields later if plugin authors need them.
Event emission
With the redesign, every event dataclass gets a __str__ returning a useful one-liner, and call sites just use standard logger methods:
logger.info(job.to_event(JobStatus.INFO))
logger.info(DagEvent(action="building"))
logger.error(ErrorEvent.from_exception(ex, linemaps))
logger.debug(DagEvent(action="resolving", rule="align"))
An EventPromotingFilter on the logger detects when record.msg is a LogEventData instance, copies it to record.event_data, and stamps the workflow_id:
class EventPromotingFilter(logging.Filter):
def __init__(self, workflow_id=None):
self.workflow_id = workflow_id
def filter(self, record):
if isinstance(record.msg, LogEventData):
record.event_data = record.msg
if self.workflow_id and not record.msg.workflow_id:
record.msg.workflow_id = self.workflow_id
return True
Added once during LoggerManager setup. Handlers and handler-level filters see record.event_data already populated. record.getMessage() returns the __str__ output. No special methods on the dataclass — just standard logging.
Job.log_info(), Job.log_error(), and Job.get_log_error_info() (~75 lines) collapse into Job.to_event(status, **overrides). GroupJob.to_events() delegates to each child's to_event() with group_id set. The Job builds the data, the caller decides when and at what level to log it.
Structured executor data
See snakemake/snakemake-interface-common/issues/95 for more details.
In brief: JobResult is a new dataclass in snakemake-interface-common that executors populate with structured execution results — exit code, failure mode, external job ID, execution environment. BenchmarkData is a companion type that BenchmarkRecord exports to, giving plugins typed access to resource usage through the logger. Both flow through JobEvent: event.result for executor data, event.description.benchmark for profiling data.
The key motivation beyond logging: structured failure data unblocks failure-aware retries (#2441, #2123, slurm#65), and persisting external job IDs addresses #226.
Verbosity
The problem
The current Quietness system is something I inherited from the old text_handler's is_quiet_about() method and mapped more or less directly. It doesn't map to what users expect — --quiet rules suppresses errors, --quiet progress suppresses the job stats summary. (#3856, #2882, #2973)
Proposed: numeric scale
Status: needs further discussion. I'm not fully settled on the details yet. The general direction:
| Level |
Flag |
Shows |
| 0 |
--silent |
Errors only |
| 1 |
--quiet / -q |
+ progress, workflow summary, job finish one-liners |
| 2 |
(default) |
+ full job info, deployment, storage, DAG status |
| 3 |
--verbose / -v |
+ all plain info messages, shell commands |
| 4 |
--debug |
+ debug-level DAG resolution, internal state |
VerbosityFilter would replace DefaultFilter. Old --quiet [rules|progress|all] deprecated with a warning. The details need more thought — particularly around how this interacts with subprocess/remote execution and what users actually expect from --quiet.
Compact formatting is out of scope for this effort. I'm working on a rich logging plugin in parallel — that's where formatting concerns belong, in the plugin, not baked into core.
Plugin base refactor (#44)
The problem
The current LogHandlerBase is itself a logging.Handler subclass, so plugin authors can't inherit from StreamHandler or FileHandler without constructor conflicts. @jlumpe raised this in #44 and #37.
Proposed: factory pattern
Replace LogHandlerBase with a base class that creates a handler from a factory method rather than being one itself. Plugins that just need a custom formatter on a FileHandler can do exactly that. Backwards compatibility via a shim wrapping old-style plugins. This pairs naturally with the event model changes — both are breaking, so we do them together in v3.
Related issues/PRs:
Background
Snakemake's logging has been through a few iterations. The original system was a custom
Loggersingleton with aself.log_handlercallback list and a monolithictext_handlermethod. Every log message was a dict with alevelstring —"job_info","job_error","shellcmd","progress", etc. — andtext_handlerwas a giant if/elif chain dispatching on those strings:When I refactored logging to use standard Python
loggingand added the logger plugin interface, I created theLogEventenum by directly mapping these old level strings to enum values."job_info"becameLogEvent.JOB_INFO."job_error"becameLogEvent.JOB_ERROR."shellcmd"becameLogEvent.SHELLCMD. This was the pragmatic choice at the time — it preserved existing behavior and let me get the plugin interface shipped without rethinking everything at once. But in retrospect I should have taken the time to rethink the event boundaries —JOB_INFOandJOB_STARTEDcould have been unified, groups didn't need their own event types, and the emission pattern (building bespokeextradicts withNoneas the message) was carried over from the old system when it didn't need to be.PR #32 adds typed dataclasses for parsing
LogRecordpayloads, and @jlumpe has contributed significantly to the design (theLogEventDatabase class,from_record()/from_extra()defaults, alias handling) and raised the plugin base refactor discussion in #44. But the dataclasses are still organized around those same legacy event types.This RFC proposes redesigning the event model now that we have a clearer picture of what plugins and monitoring tools actually need. Most of the issues below trace directly back to that 1:1 mapping I did from the old
text_handlerlevels. Before moving forward — I want to gather thoughts and feedback.Event model
The problem
The current event types mirror the old
text_handlerdispatch, not the domain. A job's lifecycle is split acrossJOB_INFO,JOB_STARTED,JOB_FINISHED, andJOB_ERROR— four types with different field shapes that plugins have to correlate. Groups add two more. Emission is scattered:Job.log_info()emitsJOB_INFO, then a separateSHELLCMDfor the command, then separatelogger.info()calls for checkpoint notes. One conceptual moment ("this job is about to run") produces 3-4 log records.Call sites build bespoke dicts (
extra=dict(event=LogEvent.JOB_INFO, jobid=..., ...)) with inconsistent field names (jobidvsjob_id,groupidvsgroup_id). Structured events get emitted withNoneas the message — the real content is inextra— which is whyColorizingTextHandler.emit()has a special guard against it. I carried this pattern over from the old system — it worked, but it means there's no standard way for a handler to get a useful string representation of an event without reimplementing the formatter's logic.Proposed: 8 event types replacing 14
Replace 14
LogEventvalues with 8, organized around domain concepts:WORKFLOW_STARTED,RUN_INFO,RESOURCES_INFO,RULEGRAPHWORKFLOWneeds_rulegraphgoes away.JOB_INFO,JOB_STARTED,JOB_FINISHED,JOB_ERRORJOBJOB_STARTEDbatch notification becomes a plain log message.GROUP_INFO,GROUP_ERRORJOB)group_id. No separate group events.SHELLCMDfromlog_infoJOB)JobDescription.SHELLCMDfromshell.__new__SHELLCMDDEBUG_DAGDAGPROGRESSPROGRESSERRORERRORDEPLOYMENTSTORAGEDEPLOYMENTandSTORAGEaddress #49 — structured info about environments, storage operations, etc. that are currently justlogger.info()strings no plugin can parse.Event structure
All events inherit from
LogEventData, which carries aworkflow_idfield. This gets stamped automatically byEventPromotingFilterso call sites don't need to pass it — every event is associated with the workflow run that produced it.Complex events use nested dataclasses. Simpler events (
DagEvent,ShellCmdEvent,ProgressEvent) are flat.JobEventhas identity fields (job_id, rule_name, group_id, status) plus two optional nested objects:JobDescription— static info: inputs, outputs, wildcards, resources, shellcmd, benchmark data, etc. Populated on info, finished, and error so each event is self-contained.JobResult— execution result from the executor: exit code, failure mode, external job ID, execution environment.WorkflowEventhas identity + status plus:WorkflowDescription— job count stats, resources, rulegraph, host info. Populated on started.DeploymentEventis designed around the newsnakemake-interface-software-deployment-plugins. SDM plugins have clear lifecycle phases (pin → cache → deploy → activate → remove) and acommon_settings.providesvalue identifying the provider.DeploymentEventuses these directly:Forward-compatible with any future SDM plugin — a monitoring tool sees
provider="conda", action="deploy"and can render it however it wants.ErrorEventis for non-job exceptions logged throughprint_exception(). Snakemake's exception hierarchy already carries workflow context (RuleExceptionhas.rule,.lineno,.filename, etc.) butprint_exception()flattens it all into a formatted string.ErrorEventextracts the fields plugins actually need:from_exception()lives in snakemake core (it knows aboutRuleExceptionvsWorkflowError). The dataclass lives in the interface package.messagecarries the full formatted output fromformat_exception_to_string(). I'm intentionally keeping this minimal — things like affected files or nested exception trees are in the message and can be promoted to typed fields later if plugin authors need them.Event emission
With the redesign, every event dataclass gets a
__str__returning a useful one-liner, and call sites just use standard logger methods:An
EventPromotingFilteron the logger detects whenrecord.msgis aLogEventDatainstance, copies it torecord.event_data, and stamps theworkflow_id:Added once during
LoggerManagersetup. Handlers and handler-level filters seerecord.event_dataalready populated.record.getMessage()returns the__str__output. No special methods on the dataclass — just standard logging.Job.log_info(),Job.log_error(), andJob.get_log_error_info()(~75 lines) collapse intoJob.to_event(status, **overrides).GroupJob.to_events()delegates to each child'sto_event()withgroup_idset. The Job builds the data, the caller decides when and at what level to log it.Structured executor data
In brief:
JobResultis a new dataclass insnakemake-interface-commonthat executors populate with structured execution results — exit code, failure mode, external job ID, execution environment.BenchmarkDatais a companion type thatBenchmarkRecordexports to, giving plugins typed access to resource usage through the logger. Both flow throughJobEvent:event.resultfor executor data,event.description.benchmarkfor profiling data.The key motivation beyond logging: structured failure data unblocks failure-aware retries (#2441, #2123, slurm#65), and persisting external job IDs addresses #226.
Verbosity
The problem
The current
Quietnesssystem is something I inherited from the oldtext_handler'sis_quiet_about()method and mapped more or less directly. It doesn't map to what users expect —--quiet rulessuppresses errors,--quiet progresssuppresses the job stats summary. (#3856, #2882, #2973)Proposed: numeric scale
Status: needs further discussion. I'm not fully settled on the details yet. The general direction:
--silent--quiet/-q--verbose/-v--debugVerbosityFilterwould replaceDefaultFilter. Old--quiet [rules|progress|all]deprecated with a warning. The details need more thought — particularly around how this interacts with subprocess/remote execution and what users actually expect from--quiet.Compact formatting is out of scope for this effort. I'm working on a rich logging plugin in parallel — that's where formatting concerns belong, in the plugin, not baked into core.
Plugin base refactor (#44)
The problem
The current
LogHandlerBaseis itself alogging.Handlersubclass, so plugin authors can't inherit fromStreamHandlerorFileHandlerwithout constructor conflicts. @jlumpe raised this in #44 and #37.Proposed: factory pattern
Replace
LogHandlerBasewith a base class that creates a handler from a factory method rather than being one itself. Plugins that just need a custom formatter on aFileHandlercan do exactly that. Backwards compatibility via a shim wrapping old-style plugins. This pairs naturally with the event model changes — both are breaking, so we do them together in v3.