Skip to content

Better exception logging #53

@jlumpe

Description

@jlumpe

Currently it's difficult to process exceptions in logger plugins. There are mostly two main issues:

Exception details not provided in log event

The ERROR event seems to be intended to have more details about the exception (the planned ErrorEvent dataclass has multiple different attributes). Currently this is emitted from snakemake.exceptions.print_exception(). This formats the exception using format_exception_to_string() for the log message, but the only additional information attached is the name of the exception type. This makes it hard for plugins to do anything with the error other than display the preformatted message provided.

A quick and easy solution is to attach the exception instance to the log record for the ERROR event in the standard way, using the exc_info argument to the logging method. This makes everything accessible to loggers, including the source file and line number. The linemaps argument to print_exception() could also be included in the extra dict, so that loggers can set the correct line numbers in tracebacks.

Multiple error events

If an error occurs during a job, two separate events are emitted: ERROR and JOB_ERROR. Details about the exception itself can be attached to ERROR because the function that creates the event is passed the exception instance. However it is not possible to know which job the error corresponds to. JOB_ERROR is the opposite, it indicates which job had an error but does not have the exception details attached to it.

Solving this might be a lot more complicated. JOB_ERROR is emitted from Job.log_error(), but there are multiple different code paths leading to it. Here's what I found from my analysis:

graph TD

ERROR["`ERROR`"]
JOB_ERROR["`JOB_ERROR`"]

subgraph executor_funcs ["Executors"]
	executor_callback["`local.Executor._callback(job_info)`"]
	touch_executor_run_job["`touch.Executor.run_job()`"]
	%%executor_callback ~~~ touch_executor_run_job
end

abstractexecutor_report_job_error["`AbstractExecutor.report_job_error(job_info, msg)`"]
abstractexecutor_print_job_error["`AbstractExecutor.print_job_error(job_info, msg)`"]

print_exception["`exceptions.print_exception(ex)`"]
job_log_error["`Job.log_error(msg)`"]
groupjob_log_error["`GroupJob.log_error(msg)`"]
jobscheduler_schedule["`JobScheduler.schedule()`"]

executor_funcs -->|ex| print_exception
executor_funcs -->|job_info| abstractexecutor_report_job_error

print_exception --> ERROR

abstractexecutor_report_job_error -->|job_info, msg| abstractexecutor_print_job_error
abstractexecutor_print_job_error -->|msg| job_log_error
groupjob_log_error -->|"()"| job_log_error --> JOB_ERROR
jobscheduler_schedule -.->|"()"| job_log_error & groupjob_log_error
jobscheduler_schedule --->|"()"| print_exception
Loading

Based on this, it might be possible to add an optional job_id or job_ids argument to print_exception that could be attached to the event, which could be provided in local.Excutor._callback(). There would still be two events, but it would be possible to match them with each other.

This doesn't cover the paths from JobScheduler.schedule

Caveats

I don't know how any of this changes with remote executors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions