Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cancellation of timer operations #110

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

civodul
Copy link
Collaborator

@civodul civodul commented Oct 1, 2024

This patch series fixes #109 by supporting a "cancel" function for base operations and using it to remove timer wheel entries when a timer "loses" in a choice operation.

I'd like feedback in particular on the last commit: can someone confirm set! good enough, or should we use an atomic box instead?

I ran the test suite of the Shepherd and that of Cuirass against this branch: the former has good coverage though it uses a single POSIX thread, the latter has not-so-good coverage but uses multiple POSIX threads. Both passed.

Thoughts?

* fibers/operations.scm (<base-op>): Rename constructor to
‘%make-base-operation’.
[cancel-fn]: New field.
(make-base-operation, cancel-other-operations): New procedures.
(perform-operation)[block]: Define ‘resume’ to call
‘cancel-other-operations’ because calling the real ‘resume’.
* fibers.texi (Operations): Adjust accordingly.
* fibers/timer-wheel.scm (timer-wheel-remove!): New procedure.
Fixes wingo#109.

Previously, an operation like:

  (choice-operation (sleep-operation 1234) (get-operation channel))

would accumulate timer wheel entries every time the ‘get’ operation wins
over the ‘sleep’ operation, potentially leading to unbounded memory
usage (each ‘sleep’ timer and its associated continuation would remain
on the wheel for 1234 seconds in this case).

This commit fixes it by removing the timer wheel entry as soon as the
timer operation is canceled.

* fibers/timers.scm (timer-operation)[wheel-entry]: New variable.
Set it in block function.  Add cancel function.
* fibers/scheduler.scm (scheduler-timers): Export.
* tests/cancel-timer.scm: New file.
* Makefile.am (TESTS): Add it.
@civodul civodul requested a review from wingo October 1, 2024 20:41
@civodul civodul self-assigned this Oct 1, 2024
@wingo
Copy link
Owner

wingo commented Oct 10, 2024

Looking good, I do have a question though and it's a long one.

Basically, I want to ensure that Concurrent ML's withNack combinator is implementable in a composable way.

  1. If withNack is already implementable with what we have, then what does this patch offer in addition?
  2. If withNack is not currently possible, will it be possible with this work?

To explain withNack, first I should mention CML's guard combinator, which is essentially an operation generator: when you go to perform-operation, that operation may have a guard function, which if it is present will be called to return a (probably) fresh operation ("event", in CML language). guard lets CML's sync (our perform-operation) spawn threads at sync time, send message, whatever.

In fibers, we have considered guard to be not primitive: fibers provides primitive CML, and if you want full CML, you can layer on top (for example by having a wrapper to perform-operation that might generate additional events).

Now back to withNack. Quoting the CML book §4.2.5:

This combinator behaves like guard, in that it takes a function whose evaluation is delayed until synchronization time, and returns it as an event value. The main difference is that at synchronization time, the guard function is applied to an abort event, which is enabled only if some other event is selected in the synchronization.

And then the example goes like this (translated):

(define (annotate-operation op nack-fn)
  (withNack
    (lambda (nack-op)
      (spawn-fiber (lambda ()
                     (nack-fn (perform-operation nack-op))))
      op)))

The idea is that when you make a nack-event, your nack guard function will be called on a fresh event (operation), as if from a wait-operation on a fresh nack condition (see (fibers conditions)). The nack guard function should return a "positive" event (operation). If some other op synchronizes instead of the positive event, the nack condition is signalled.

The fiber spawned by the nack guard function could then, say, send a message back to a remote server to release some kind of resource.

The corner cases come in for withNack on choice operations:

(choice-operation op-a (withNack (choice-operation op-b op-c) ...))

Here we want the nack condition to signal if op-a synchronizes, but not to signal if either of op-b or op-c synchronize.

In https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=9393e5ba6fa5cdcd981fee71c3bdfee8841c047d §5.2, Donnely and Fluet show how to implement withNack in terms of lower-level primitives. But, I do not fully understand it yet.

So, again, my questions:

  1. If withNack is already implementable with what we have, then what encoding would it have? What does this patch offer in addition?
  2. If withNack is not currently possible, will it be possible with this work?

Also, when we settle on the answer, can I request some documentation, please? Thank you :)

@civodul
Copy link
Collaborator Author

civodul commented Oct 10, 2024

Hey @wingo,

I'm afraid I cannot answer your questions, this being my first time hearing about guard and withNack, which I'm not sure to fully understand.

What I can say is that this change preserves choice-operation semantics, which is that only one sub-operation succeeds. If withNack were built on top of those semantics, I don't see how explicit cancellation could affect it.

If we cannot answer the question of whether cancellation functions would prevent withNack from being implemented, perhaps what we could do is to not expose them in make-base-operation. Instead we'd make them available through a make-base-operation/internal binding or similar, to avoid committing to the interface change. WDYT?

Regarding documentation, the commit updates the @defun entry for make-base-operation with the new argument, but note that make-base-operation is not actually documented, only mentioned. What would you suggest?

Thanks for your detailed review!

(put-message channel 'hello)
(loop (+ i 1))))))

(let ((initial-heap-size (heap-size)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're checking heap sizes, best do a gc beforehand, otherwise some heavy activity could lead to false negatives (as in 'no bug detected even though it exists').

Would it be possible (and sufficiently informative & meaningful) to instead check the length of the timer wheel? Seems less finicky to me (e.g. what if in the future tests are run in parallel in a single process).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if something finicky like heap sizes is avoided, I imagine the number of iterations could be reduced a lot (good for test performance).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that when #109 is present, the heap would grow way beyond the 2x limit that's tested here; it would not go unnoticed. (Maybe we could make the test faster but it was already reasonably fast in my experience.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How fast is 'reasonably fast'?

;; (sched) -> ()
(cancel-fn base-op-cancel-fn))

(define* (make-base-operation wrap-fn try-fn block-fn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ABI reasons, there needs to be a way to tell to Guile 'don't inline this' (not blocking this PR).

@LiberalArtist
Copy link

In https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=9393e5ba6fa5cdcd981fee71c3bdfee8841c047d §5.2, Donnely and Fluet show how to implement withNack in terms of lower-level primitives. But, I do not fully understand it yet.

So, again, my questions:

  1. If withNack is already implementable with what we have, then what encoding would it have? What does this patch offer in addition?

  2. If withNack is not currently possible, will it be possible with this work?

Also, when we settle on the answer, can I request some documentation, please? Thank you :)

I have just seen this issue and have only skimmed the paper you linked so far, but I wrote about the lack of withNack in #97.

In particular, I'd highlight § 7 of Flatt and Findler, “Kill-Safe Synchronization Abstractions” (PLDI 2004):

Recall that the event provided to a guard procedure by
nack-guard-evt becomes ready if the guard-generated event is
not chosen. MzScheme extends the Concurrent ML definition of
“not chosen” so that it includes all of the following cases, which
cover all of the ways that a thread can abandon an event:

  • The sync call chooses an event other than the one returned by
    the guard.
  • Control escapes from the sync call through an exception or
    continuation jump. The exception or jump may have been
    triggered through a break signal (discussed further in Sec-
    tion 8.2), by another guard involved in the same sync, or even
    by the guard procedure that received the NACK event. Con-
    tinuation jumps back into a guard are always blocked by our
    definition of nack-guard-evt, so multiple escapes are not
    possible.
  • The syncing thread terminates (i.e., it is suspended and un-
    reachable).

In the code from Figure 9, the event produced by
msg-queue-recv-evt can be used in an arbitrary client
context, so all of the above cases are possible.

MzScheme’s nack-guard-evt corresponds to Concurrent ML’s
withNack. An earlier version [19] of Concurrent ML offered
wrapAbort, instead, and a later presentation [21] explains how
withNack can be implemented with wrapAbort. Our definition
of “not chosen” does not allow such an implementation, and thus
strengthens the argument that withNack is the right operation to
designate as primitive.

I haven't read Donnely and Fluet closely enough (or re-read the implementation of wrapAbort from Concurrent Programming in ML) to know if it works with the Racket definition of “not chosen”, but in my experience Racket's definition of “not chosen” is useful, so it's something I'd want to know.

I wrote in #97 that I don't think make-base-operation is sufficient even to implement the guard combinator, and from a brief look I don't see how this PR would change that. (I hadn't considered what @wingo wrote about expecting guard to be implemented by a client abstraction that doesn't expose the underlying perform-operation, but that strikes me as unappealing in a similar way to providing an un-delimited call/cc and requiring client libraries to wrap it to implement delimited continuations. Part of the benefit of having a Concurrent ML library is that other libraries should be able to have operations in their APIs, and I think you loose expressive power if using guard or withNack would require a different abstraction.)

The mention of the fact in Flatt and Findler that “continuation jumps back into a guard are always blocked by our definition of nack-guard-evt” also reminded me of a bug report on the semantics of Guile's continuation barriers that I mostly wrote two years ago but never quite sent: I'll try to do that soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory leak on choice operation of 'get' and 'sleep'
4 participants