Skip to content

RFC: v3 #52

@cademirch

Description

@cademirch

AI Disclaimer: I did use AI (claude) to draft this document. Feel free to skim! Most of the ideas/concepts here are my own or belong to those tagged. I have reviewed this substantially.

Related issues/PRs:


Background

Snakemake's logging has been through a few iterations. The original system was a custom Logger singleton with a self.log_handler callback list and a monolithic text_handler method. Every log message was a dict with a level string — "job_info", "job_error", "shellcmd", "progress", etc. — and text_handler was a giant if/elif chain dispatching on those strings:

level = msg["level"]
if level == "job_info" and not self.is_quiet_about("rules"):
    ...
elif level == "job_error":
    ...
elif level == "shellcmd":
    if self.printshellcmds:
        ...
elif level == "job_finished" and not self.is_quiet_about("progress"):
    ...

When I refactored logging to use standard Python logging and added the logger plugin interface, I created the LogEvent enum by directly mapping these old level strings to enum values. "job_info" became LogEvent.JOB_INFO. "job_error" became LogEvent.JOB_ERROR. "shellcmd" became LogEvent.SHELLCMD. This was the pragmatic choice at the time — it preserved existing behavior and let me get the plugin interface shipped without rethinking everything at once. But in retrospect I should have taken the time to rethink the event boundaries — JOB_INFO and JOB_STARTED could have been unified, groups didn't need their own event types, and the emission pattern (building bespoke extra dicts with None as the message) was carried over from the old system when it didn't need to be.

PR #32 adds typed dataclasses for parsing LogRecord payloads, and @jlumpe has contributed significantly to the design (the LogEventData base class, from_record()/from_extra() defaults, alias handling) and raised the plugin base refactor discussion in #44. But the dataclasses are still organized around those same legacy event types.

This RFC proposes redesigning the event model now that we have a clearer picture of what plugins and monitoring tools actually need. Most of the issues below trace directly back to that 1:1 mapping I did from the old text_handler levels. Before moving forward — I want to gather thoughts and feedback.


Event model

The problem

The current event types mirror the old text_handler dispatch, not the domain. A job's lifecycle is split across JOB_INFO, JOB_STARTED, JOB_FINISHED, and JOB_ERROR — four types with different field shapes that plugins have to correlate. Groups add two more. Emission is scattered: Job.log_info() emits JOB_INFO, then a separate SHELLCMD for the command, then separate logger.info() calls for checkpoint notes. One conceptual moment ("this job is about to run") produces 3-4 log records.

Call sites build bespoke dicts (extra=dict(event=LogEvent.JOB_INFO, jobid=..., ...)) with inconsistent field names (jobid vs job_id, groupid vs group_id). Structured events get emitted with None as the message — the real content is in extra — which is why ColorizingTextHandler.emit() has a special guard against it. I carried this pattern over from the old system — it worked, but it means there's no standard way for a handler to get a useful string representation of an event without reimplementing the formatter's logic.

Proposed: 8 event types replacing 14

Replace 14 LogEvent values with 8, organized around domain concepts:

Current Proposed What changes
WORKFLOW_STARTED, RUN_INFO, RESOURCES_INFO, RULEGRAPH WORKFLOW One type, status field. Stats, resources, rulegraph are fields on the started event. needs_rulegraph goes away.
JOB_INFO, JOB_STARTED, JOB_FINISHED, JOB_ERROR JOB One type, status field. JOB_STARTED batch notification becomes a plain log message.
GROUP_INFO, GROUP_ERROR (into JOB) Jobs carry an optional group_id. No separate group events.
SHELLCMD from log_info (into JOB) Shell command is a field on JobDescription.
SHELLCMD from shell.__new__ SHELLCMD Stays — actual runtime command execution.
DEBUG_DAG DAG DAG resolution at debug level + lifecycle at info level.
PROGRESS PROGRESS Unchanged.
ERROR ERROR Non-job exceptions.
(new) DEPLOYMENT Conda create, container pull, env module activate. ~20 currently unstructured calls.
(new) STORAGE File retrieve/upload/cleanup. ~10 currently unstructured calls.

DEPLOYMENT and STORAGE address #49 — structured info about environments, storage operations, etc. that are currently just logger.info() strings no plugin can parse.

Event structure

All events inherit from LogEventData, which carries a workflow_id field. This gets stamped automatically by EventPromotingFilter so call sites don't need to pass it — every event is associated with the workflow run that produced it.

Complex events use nested dataclasses. Simpler events (DagEvent, ShellCmdEvent, ProgressEvent) are flat.

JobEvent has identity fields (job_id, rule_name, group_id, status) plus two optional nested objects:

  • JobDescription — static info: inputs, outputs, wildcards, resources, shellcmd, benchmark data, etc. Populated on info, finished, and error so each event is self-contained.
  • JobResult — execution result from the executor: exit code, failure mode, external job ID, execution environment.

WorkflowEvent has identity + status plus:

  • WorkflowDescription — job count stats, resources, rulegraph, host info. Populated on started.

DeploymentEvent is designed around the new snakemake-interface-software-deployment-plugins. SDM plugins have clear lifecycle phases (pin → cache → deploy → activate → remove) and a common_settings.provides value identifying the provider. DeploymentEvent uses these directly:

@dataclass
class DeploymentEvent(LogEventData):
    provider: str = ""      # from common_settings.provides: "conda", "container", etc.
    action: str = ""        # "pin", "cache", "deploy", "activate", "remove", "check"
    spec: Optional[str] = None
    detail: Optional[str] = None

Forward-compatible with any future SDM plugin — a monitoring tool sees provider="conda", action="deploy" and can render it however it wants.

ErrorEvent is for non-job exceptions logged through print_exception(). Snakemake's exception hierarchy already carries workflow context (RuleException has .rule, .lineno, .filename, etc.) but print_exception() flattens it all into a formatted string. ErrorEvent extracts the fields plugins actually need:

@dataclass
class ErrorEvent(LogEventData):
    message: str = ""
    exception_type: str = ""
    rule_name: Optional[str] = None
    snakefile: Optional[str] = None
    lineno: Optional[int] = None
 
    @classmethod
    def from_exception(cls, ex, linemaps=None): ...

from_exception() lives in snakemake core (it knows about RuleException vs WorkflowError). The dataclass lives in the interface package. message carries the full formatted output from format_exception_to_string(). I'm intentionally keeping this minimal — things like affected files or nested exception trees are in the message and can be promoted to typed fields later if plugin authors need them.

Event emission

With the redesign, every event dataclass gets a __str__ returning a useful one-liner, and call sites just use standard logger methods:

logger.info(job.to_event(JobStatus.INFO))
logger.info(DagEvent(action="building"))
logger.error(ErrorEvent.from_exception(ex, linemaps))
logger.debug(DagEvent(action="resolving", rule="align"))

An EventPromotingFilter on the logger detects when record.msg is a LogEventData instance, copies it to record.event_data, and stamps the workflow_id:

class EventPromotingFilter(logging.Filter):
    def __init__(self, workflow_id=None):
        self.workflow_id = workflow_id
 
    def filter(self, record):
        if isinstance(record.msg, LogEventData):
            record.event_data = record.msg
            if self.workflow_id and not record.msg.workflow_id:
                record.msg.workflow_id = self.workflow_id
        return True

Added once during LoggerManager setup. Handlers and handler-level filters see record.event_data already populated. record.getMessage() returns the __str__ output. No special methods on the dataclass — just standard logging.

Job.log_info(), Job.log_error(), and Job.get_log_error_info() (~75 lines) collapse into Job.to_event(status, **overrides). GroupJob.to_events() delegates to each child's to_event() with group_id set. The Job builds the data, the caller decides when and at what level to log it.


Structured executor data

See snakemake/snakemake-interface-common/issues/95 for more details.

In brief: JobResult is a new dataclass in snakemake-interface-common that executors populate with structured execution results — exit code, failure mode, external job ID, execution environment. BenchmarkData is a companion type that BenchmarkRecord exports to, giving plugins typed access to resource usage through the logger. Both flow through JobEvent: event.result for executor data, event.description.benchmark for profiling data.

The key motivation beyond logging: structured failure data unblocks failure-aware retries (#2441, #2123, slurm#65), and persisting external job IDs addresses #226.


Verbosity

The problem

The current Quietness system is something I inherited from the old text_handler's is_quiet_about() method and mapped more or less directly. It doesn't map to what users expect — --quiet rules suppresses errors, --quiet progress suppresses the job stats summary. (#3856, #2882, #2973)

Proposed: numeric scale

Status: needs further discussion. I'm not fully settled on the details yet. The general direction:

Level Flag Shows
0 --silent Errors only
1 --quiet / -q + progress, workflow summary, job finish one-liners
2 (default) + full job info, deployment, storage, DAG status
3 --verbose / -v + all plain info messages, shell commands
4 --debug + debug-level DAG resolution, internal state

VerbosityFilter would replace DefaultFilter. Old --quiet [rules|progress|all] deprecated with a warning. The details need more thought — particularly around how this interacts with subprocess/remote execution and what users actually expect from --quiet.

Compact formatting is out of scope for this effort. I'm working on a rich logging plugin in parallel — that's where formatting concerns belong, in the plugin, not baked into core.


Plugin base refactor (#44)

The problem

The current LogHandlerBase is itself a logging.Handler subclass, so plugin authors can't inherit from StreamHandler or FileHandler without constructor conflicts. @jlumpe raised this in #44 and #37.

Proposed: factory pattern

Replace LogHandlerBase with a base class that creates a handler from a factory method rather than being one itself. Plugins that just need a custom formatter on a FileHandler can do exactly that. Backwards compatibility via a shim wrapping old-style plugins. This pairs naturally with the event model changes — both are breaking, so we do them together in v3.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions