-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/CUDA_COPY: Fix ep_flush logic; Remove sync_streams usage #10493
base: master
Are you sure you want to change the base?
UCT/CUDA_COPY: Fix ep_flush logic; Remove sync_streams usage #10493
Conversation
@Akshay-Venkatesh is it really enough to return OK when there areno outstanding operations on the stream? AFAIU, it does not guarantee the data is really copied yet? |
@yosefe Every operation posted on a stream is recorded on the stream. If the events are not present on event queues then operations are complete from the initiator's perspective. For the implementation perspective of ep/iface flush, if event queues are empty, it should be correct to return OK. Do you have a case in mind that would break with the changes in this PR?
Will address this. |
Flush operation on UCT should guarantee that operation is also complete from the target side |
Confirmed that this is true. Copy operations when completed (through explicit synchronize operation or through event query), guarantee completion at both initiator and target. I see the following problem invoking completion if comp!=NULL with the below diff: diff --git a/src/uct/cuda/cuda_copy/cuda_copy_iface.c b/src/uct/cuda/cuda_copy/cuda_copy_iface.c
index 53d8acaef..082bd75ab 100644
--- a/src/uct/cuda/cuda_copy/cuda_copy_iface.c
+++ b/src/uct/cuda/cuda_copy/cuda_copy_iface.c
@@ -294,6 +294,10 @@ uct_cuda_copy_ep_flush(uct_ep_h tl_ep, unsigned flags, uct_completion_t *comp)
}
}
+ if (comp != NULL) {
+ uct_invoke_completion(comp, UCS_OK);
+ }
+
return uct_base_ep_flush(tl_ep, flags, comp);
} [07:655137:0:655137] uct_iface.h:970 Assertion `comp->count > 0' failed: comp=0xc13e18855418 (ucp_ep_flush_completion) count=0 status=0
==== backtrace (tid: 655137) ====
0 $UCX_HOME/lib/libucs.so.0(ucs_handle_error+0x2cc) [0xf147d0c76f3c]
1 $UCX_HOME/lib/libucs.so.0(ucs_fatal_error_message+0xe0) [0xf147d0c74310]
2 $UCX_HOME/lib/libucs.so.0(ucs_fatal_error_format+0xfc) [0xf147d0c74410]
3 $UCX_HOME/lib/ucx/libuct_cuda.so.0(+0xb324) [0xf147d09bb324]
4 $UCX_HOME/lib/libucp.so.0(+0x9ea74) [0xf147d0a9ea74]
5 $UCX_HOME/lib/libucp.so.0(ucp_ep_flush_completion+0xb0) [0xf147d0a9f130]
6 $UCX_HOME/lib/ucx/libuct_cuda.so.0(+0xb270) [0xf147d09bb270]
7 $UCX_HOME/lib/libucp.so.0(+0x9ea74) [0xf147d0a9ea74]
8 $UCX_HOME/lib/libucp.so.0(ucp_ep_flush_internal+0x1bc) [0xf147d0aa006c]
9 $UCX_HOME/lib/libucp.so.0(ucp_ep_close_nbx+0x294) [0xf147d0a4a748]
10 $UCX_HOME/lib/libucp.so.0(ucp_ep_close_nb+0x44) [0xf147d0a4aa44]
Also, regarding your comment about calling completion callback when returning INPROGRESS, looking at semantics of ep_flush, we are required to invoke completion callback if return is not UCS_INPROGRESS but if it is UCS_OK, right? At any rate, I'd appreciate if you feedback on how completion ought to be invoked if all operations are flushed. * @return UCS_OK - No outstanding communications left.
* UCS_ERR_NO_RESOURCE - Flush operation could not be initiated. A subsequent
* call to @ref uct_ep_pending_add would add a pending
* operation, which provides an opportunity to retry
* the flush.
* UCS_INPROGRESS - Some communication operations are still in progress.
* If non-NULL 'comp' is provided, it will be updated
* upon completion of these operations.
*/
UCT_INLINE_API ucs_status_t uct_ep_flush(uct_ep_h ep, unsigned flags,
uct_completion_t *comp)
{
return ep->iface->ops.ep_flush(ep, flags, comp);
} |
What?