RFC: v3

>AI Disclaimer: I did use AI (claude) to draft this document. Feel free to skim! Most of the ideas/concepts here are my own or belong to those tagged. I have reviewed this substantially.

**Related issues/PRs:**
- [logger-plugins#32](https://github.com/snakemake/snakemake-interface-logger-plugins/pull/32) — Event dataclasses
- [logger-plugins#49](https://github.com/snakemake/snakemake-interface-logger-plugins/issues/49) — More log events / more logging information
- [logger-plugins#44](https://github.com/snakemake/snakemake-interface-logger-plugins/issues/44) — Refactor plugin base
- [snakemake#3856](https://github.com/snakemake/snakemake/issues/3856) — Reasonable "quiet" option
- [snakemake#2882](https://github.com/snakemake/snakemake/issues/2882) — --quiet settings not intuitive
- [snakemake#2973](https://github.com/snakemake/snakemake/issues/2973) — --quiet no longer prints summary
- [snakemake#2441](https://github.com/snakemake/snakemake/issues/2441) — More control on retry behaviour
- [snakemake#2123](https://github.com/snakemake/snakemake/issues/2123) — Cluster retry get failure reason
- [snakemake#226](https://github.com/snakemake/snakemake/issues/226) — Keep track of external job IDs after completion
- [slurm#65](https://github.com/snakemake/snakemake-executor-plugin-slurm/issues/65) — Dynamic resources depending on retry reason
 
---
 
## Background
 
Snakemake's logging has been through a few iterations. The original system was a custom `Logger` singleton with a `self.log_handler` callback list and a monolithic `text_handler` method. Every log message was a dict with a `level` string — `"job_info"`, `"job_error"`, `"shellcmd"`, `"progress"`, etc. — and `text_handler` was a giant if/elif chain dispatching on those strings:
 
```python
level = msg["level"]
if level == "job_info" and not self.is_quiet_about("rules"):
    ...
elif level == "job_error":
    ...
elif level == "shellcmd":
    if self.printshellcmds:
        ...
elif level == "job_finished" and not self.is_quiet_about("progress"):
    ...
```
 
When I refactored logging to use standard Python `logging` and added the logger plugin interface, I created the `LogEvent` enum by directly mapping these old level strings to enum values. `"job_info"` became `LogEvent.JOB_INFO`. `"job_error"` became `LogEvent.JOB_ERROR`. `"shellcmd"` became `LogEvent.SHELLCMD`. This was the pragmatic choice at the time — it preserved existing behavior and let me get the plugin interface shipped without rethinking everything at once. But in retrospect I should have taken the time to rethink the event boundaries — `JOB_INFO` and `JOB_STARTED` could have been unified, groups didn't need their own event types, and the emission pattern (building bespoke `extra` dicts with `None` as the message) was carried over from the old system when it didn't need to be.
 
PR #32 adds typed dataclasses for parsing `LogRecord` payloads, and @jlumpe has contributed significantly to the design (the `LogEventData` base class, `from_record()`/`from_extra()` defaults, alias handling) and raised the plugin base refactor discussion in #44. But the dataclasses are still organized around those same legacy event types.
 
This RFC proposes redesigning the event model now that we have a clearer picture of what plugins and monitoring tools actually need. Most of the issues below trace directly back to that 1:1 mapping I did from the old `text_handler` levels. Before moving forward — I want to gather thoughts and feedback.
 
---
 
## Event model
 
### The problem
 
The current event types mirror the old `text_handler` dispatch, not the domain. A job's lifecycle is split across `JOB_INFO`, `JOB_STARTED`, `JOB_FINISHED`, and `JOB_ERROR` — four types with different field shapes that plugins have to correlate. Groups add two more. Emission is scattered: `Job.log_info()` emits `JOB_INFO`, then a separate `SHELLCMD` for the command, then separate `logger.info()` calls for checkpoint notes. One conceptual moment ("this job is about to run") produces 3-4 log records.
 
Call sites build bespoke dicts (`extra=dict(event=LogEvent.JOB_INFO, jobid=..., ...)`) with inconsistent field names (`jobid` vs `job_id`, `groupid` vs `group_id`). Structured events get emitted with `None` as the message — the real content is in `extra` — which is why `ColorizingTextHandler.emit()` has a special guard against it. I carried this pattern over from the old system — it worked, but it means there's no standard way for a handler to get a useful string representation of an event without reimplementing the formatter's logic.
 
### Proposed: 8 event types replacing 14
 
Replace 14 `LogEvent` values with 8, organized around domain concepts:
 
| Current | Proposed | What changes |
|---|---|---|
| `WORKFLOW_STARTED`, `RUN_INFO`, `RESOURCES_INFO`, `RULEGRAPH` | `WORKFLOW` | One type, status field. Stats, resources, rulegraph are fields on the started event. `needs_rulegraph` goes away. |
| `JOB_INFO`, `JOB_STARTED`, `JOB_FINISHED`, `JOB_ERROR` | `JOB` | One type, status field. `JOB_STARTED` batch notification becomes a plain log message. |
| `GROUP_INFO`, `GROUP_ERROR` | *(into `JOB`)* | Jobs carry an optional `group_id`. No separate group events. |
| `SHELLCMD` from `log_info` | *(into `JOB`)* | Shell command is a field on `JobDescription`. |
| `SHELLCMD` from `shell.__new__` | `SHELLCMD` | Stays — actual runtime command execution. |
| `DEBUG_DAG` | `DAG` | DAG resolution at debug level + lifecycle at info level. |
| `PROGRESS` | `PROGRESS` | Unchanged. |
| `ERROR` | `ERROR` | Non-job exceptions. |
| *(new)* | `DEPLOYMENT` | Conda create, container pull, env module activate. ~20 currently unstructured calls. |
| *(new)* | `STORAGE` | File retrieve/upload/cleanup. ~10 currently unstructured calls. |
 
`DEPLOYMENT` and `STORAGE` address #49 — structured info about environments, storage operations, etc. that are currently just `logger.info()` strings no plugin can parse.
 
### Event structure
 
All events inherit from `LogEventData`, which carries a `workflow_id` field. This gets stamped automatically by `EventPromotingFilter` so call sites don't need to pass it — every event is associated with the workflow run that produced it.
 
Complex events use nested dataclasses. Simpler events (`DagEvent`, `ShellCmdEvent`, `ProgressEvent`) are flat.
 
**`JobEvent`** has identity fields (job_id, rule_name, group_id, status) plus two optional nested objects:
- **`JobDescription`** — static info: inputs, outputs, wildcards, resources, shellcmd, benchmark data, etc. Populated on info, finished, and error so each event is self-contained.
- **`JobResult`** — execution result from the executor: exit code, failure mode, external job ID, execution environment. 
 
**`WorkflowEvent`** has identity + status plus:
- **`WorkflowDescription`** — job count stats, resources, rulegraph, host info. Populated on started.
 
**`DeploymentEvent`** is designed around the new `snakemake-interface-software-deployment-plugins`. SDM plugins have clear lifecycle phases (pin → cache → deploy → activate → remove) and a `common_settings.provides` value identifying the provider. `DeploymentEvent` uses these directly:
 
```python
@dataclass
class DeploymentEvent(LogEventData):
    provider: str = ""      # from common_settings.provides: "conda", "container", etc.
    action: str = ""        # "pin", "cache", "deploy", "activate", "remove", "check"
    spec: Optional[str] = None
    detail: Optional[str] = None
```
 
Forward-compatible with any future SDM plugin — a monitoring tool sees `provider="conda", action="deploy"` and can render it however it wants.
 
**`ErrorEvent`** is for non-job exceptions logged through `print_exception()`. Snakemake's exception hierarchy already carries workflow context (`RuleException` has `.rule`, `.lineno`, `.filename`, etc.) but `print_exception()` flattens it all into a formatted string. `ErrorEvent` extracts the fields plugins actually need:
 
```python
@dataclass
class ErrorEvent(LogEventData):
    message: str = ""
    exception_type: str = ""
    rule_name: Optional[str] = None
    snakefile: Optional[str] = None
    lineno: Optional[int] = None
 
    @classmethod
    def from_exception(cls, ex, linemaps=None): ...
```
 
`from_exception()` lives in snakemake core (it knows about `RuleException` vs `WorkflowError`). The dataclass lives in the interface package. `message` carries the full formatted output from `format_exception_to_string()`. I'm intentionally keeping this minimal — things like affected files or nested exception trees are in the message and can be promoted to typed fields later if plugin authors need them.
 
### Event emission
 
With the redesign, every event dataclass gets a `__str__` returning a useful one-liner, and call sites just use standard logger methods:
 
```python
logger.info(job.to_event(JobStatus.INFO))
logger.info(DagEvent(action="building"))
logger.error(ErrorEvent.from_exception(ex, linemaps))
logger.debug(DagEvent(action="resolving", rule="align"))
```
 
An `EventPromotingFilter` on the logger detects when `record.msg` is a `LogEventData` instance, copies it to `record.event_data`, and stamps the `workflow_id`:
 
```python
class EventPromotingFilter(logging.Filter):
    def __init__(self, workflow_id=None):
        self.workflow_id = workflow_id
 
    def filter(self, record):
        if isinstance(record.msg, LogEventData):
            record.event_data = record.msg
            if self.workflow_id and not record.msg.workflow_id:
                record.msg.workflow_id = self.workflow_id
        return True
```
 
Added once during `LoggerManager` setup. Handlers and handler-level filters see `record.event_data` already populated. `record.getMessage()` returns the `__str__` output. No special methods on the dataclass — just standard logging.
 
`Job.log_info()`, `Job.log_error()`, and `Job.get_log_error_info()` (~75 lines) collapse into `Job.to_event(status, **overrides)`. `GroupJob.to_events()` delegates to each child's `to_event()` with `group_id` set. The Job builds the data, the caller decides when and at what level to log it.
 
---
 
## Structured executor data
 
> See snakemake/snakemake-interface-common/issues/95 for more details.
 
In brief: `JobResult` is a new dataclass in `snakemake-interface-common` that executors populate with structured execution results — exit code, failure mode, external job ID, execution environment. `BenchmarkData` is a companion type that `BenchmarkRecord` exports to, giving plugins typed access to resource usage through the logger. Both flow through `JobEvent`: `event.result` for executor data, `event.description.benchmark` for profiling data.
 
The key motivation beyond logging: structured failure data unblocks failure-aware retries (#2441, #2123, slurm#65), and persisting external job IDs addresses #226.
 
---
 
## Verbosity
 
### The problem
 
The current `Quietness` system is something I inherited from the old `text_handler`'s `is_quiet_about()` method and mapped more or less directly. It doesn't map to what users expect — `--quiet rules` suppresses errors, `--quiet progress` suppresses the job stats summary. (#3856, #2882, #2973)
 
### Proposed: numeric scale
 
**Status: needs further discussion.** I'm not fully settled on the details yet. The general direction:
 
| Level | Flag | Shows |
|---|---|---|
| 0 | `--silent` | Errors only |
| 1 | `--quiet` / `-q` | + progress, workflow summary, job finish one-liners |
| 2 | *(default)* | + full job info, deployment, storage, DAG status |
| 3 | `--verbose` / `-v` | + all plain info messages, shell commands |
| 4 | `--debug` | + debug-level DAG resolution, internal state |
 
`VerbosityFilter` would replace `DefaultFilter`. Old `--quiet [rules|progress|all]` deprecated with a warning. The details need more thought — particularly around how this interacts with subprocess/remote execution and what users actually expect from `--quiet`.
 
**Compact formatting** is out of scope for this effort. I'm working on a rich logging plugin in parallel — that's where formatting concerns belong, in the plugin, not baked into core.
 
---
 
## Plugin base refactor (#44)
 
### The problem
 
The current `LogHandlerBase` is itself a `logging.Handler` subclass, so plugin authors can't inherit from `StreamHandler` or `FileHandler` without constructor conflicts. @jlumpe raised this in #44 and #37.
 
### Proposed: factory pattern
 
Replace `LogHandlerBase` with a base class that creates a handler from a factory method rather than being one itself. Plugins that just need a custom formatter on a `FileHandler` can do exactly that. Backwards compatibility via a shim wrapping old-style plugins. This pairs naturally with the event model changes — both are breaking, so we do them together in v3.

Current	Proposed	What changes
`WORKFLOW_STARTED`, `RUN_INFO`, `RESOURCES_INFO`, `RULEGRAPH`	`WORKFLOW`	One type, status field. Stats, resources, rulegraph are fields on the started event. `needs_rulegraph` goes away.
`JOB_INFO`, `JOB_STARTED`, `JOB_FINISHED`, `JOB_ERROR`	`JOB`	One type, status field. `JOB_STARTED` batch notification becomes a plain log message.
`GROUP_INFO`, `GROUP_ERROR`	(into `JOB`)	Jobs carry an optional `group_id`. No separate group events.
`SHELLCMD` from `log_info`	(into `JOB`)	Shell command is a field on `JobDescription`.
`SHELLCMD` from `shell.__new__`	`SHELLCMD`	Stays — actual runtime command execution.
`DEBUG_DAG`	`DAG`	DAG resolution at debug level + lifecycle at info level.
`PROGRESS`	`PROGRESS`	Unchanged.
`ERROR`	`ERROR`	Non-job exceptions.
(new)	`DEPLOYMENT`	Conda create, container pull, env module activate. ~20 currently unstructured calls.
(new)	`STORAGE`	File retrieve/upload/cleanup. ~10 currently unstructured calls.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: v3 #52

Background

Event model

The problem

Proposed: 8 event types replacing 14

Event structure

Event emission

Structured executor data

Verbosity

The problem

Proposed: numeric scale

Plugin base refactor (#44)

The problem

Proposed: factory pattern

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Level	Flag	Shows
0	`--silent`	Errors only
1	`--quiet` / `-q`	+ progress, workflow summary, job finish one-liners
2	(default)	+ full job info, deployment, storage, DAG status
3	`--verbose` / `-v`	+ all plain info messages, shell commands
4	`--debug`	+ debug-level DAG resolution, internal state

RFC: v3 #52

Description

Background

Event model

The problem

Proposed: 8 event types replacing 14

Event structure

Event emission

Structured executor data

Verbosity

The problem

Proposed: numeric scale

Plugin base refactor (#44)

The problem

Proposed: factory pattern

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions