Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA_COPY: Fix ep_flush logic; Remove sync_streams usage #10493

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Akshay-Venkatesh
Copy link
Contributor

What?

  • Fix ep_flush logic to return IN_PROGRESS if there are outstanding events in event queue.
  • Remove uct_cuda_copy_sync_streams because it is unnecessary. Iface/EP flush operations are allowed to return IN_PROGRESS if there are outstanding events and user of the transport will call progress

@yosefe
Copy link
Contributor

yosefe commented Feb 16, 2025

@Akshay-Venkatesh is it really enough to return OK when there areno outstanding operations on the stream? AFAIU, it does not guarantee the data is really copied yet?
also , when returning INPROGRESS, and comp!=NULL, need to call the completion callback when the ep is flushed
see also this NVBug

@Akshay-Venkatesh
Copy link
Contributor Author

@Akshay-Venkatesh is it really enough to return OK when there areno outstanding operations on the stream? AFAIU, it does not guarantee the data is really copied yet?

@yosefe Every operation posted on a stream is recorded on the stream. If the events are not present on event queues then operations are complete from the initiator's perspective. For the implementation perspective of ep/iface flush, if event queues are empty, it should be correct to return OK. Do you have a case in mind that would break with the changes in this PR?

also , when returning INPROGRESS, and comp!=NULL, need to call the completion callback when the ep is flushed
see also this NVBug

Will address this.

@yosefe
Copy link
Contributor

yosefe commented Feb 18, 2025

@yosefe Every operation posted on a stream is recorded on the stream. If the events are not present on event queues then operations are complete from the initiator's perspective. For the implementation perspective of ep/iface flush, if event queues are empty, it should be correct to return OK. Do you have a case in mind that would break with the changes in this PR?

Flush operation on UCT should guarantee that operation is also complete from the target side

@Akshay-Venkatesh
Copy link
Contributor Author

Flush operation on UCT should guarantee that operation is also complete from the target side

Confirmed that this is true. Copy operations when completed (through explicit synchronize operation or through event query), guarantee completion at both initiator and target.

I see the following problem invoking completion if comp!=NULL with the below diff:

diff --git a/src/uct/cuda/cuda_copy/cuda_copy_iface.c b/src/uct/cuda/cuda_copy/cuda_copy_iface.c
index 53d8acaef..082bd75ab 100644
--- a/src/uct/cuda/cuda_copy/cuda_copy_iface.c
+++ b/src/uct/cuda/cuda_copy/cuda_copy_iface.c
@@ -294,6 +294,10 @@ uct_cuda_copy_ep_flush(uct_ep_h tl_ep, unsigned flags, uct_completion_t *comp)
         }
     }

+    if (comp != NULL) {
+        uct_invoke_completion(comp, UCS_OK);
+    }
+
     return uct_base_ep_flush(tl_ep, flags, comp);
 }
[07:655137:0:655137]   uct_iface.h:970  Assertion `comp->count > 0' failed: comp=0xc13e18855418 (ucp_ep_flush_completion) count=0 status=0
==== backtrace (tid: 655137) ====
 0  $UCX_HOME/lib/libucs.so.0(ucs_handle_error+0x2cc) [0xf147d0c76f3c]
 1  $UCX_HOME/lib/libucs.so.0(ucs_fatal_error_message+0xe0) [0xf147d0c74310]
 2  $UCX_HOME/lib/libucs.so.0(ucs_fatal_error_format+0xfc) [0xf147d0c74410]
 3  $UCX_HOME/lib/ucx/libuct_cuda.so.0(+0xb324) [0xf147d09bb324]
 4  $UCX_HOME/lib/libucp.so.0(+0x9ea74) [0xf147d0a9ea74]
 5  $UCX_HOME/lib/libucp.so.0(ucp_ep_flush_completion+0xb0) [0xf147d0a9f130]
 6  $UCX_HOME/lib/ucx/libuct_cuda.so.0(+0xb270) [0xf147d09bb270]
 7  $UCX_HOME/lib/libucp.so.0(+0x9ea74) [0xf147d0a9ea74]
 8  $UCX_HOME/lib/libucp.so.0(ucp_ep_flush_internal+0x1bc) [0xf147d0aa006c]
 9  $UCX_HOME/lib/libucp.so.0(ucp_ep_close_nbx+0x294) [0xf147d0a4a748]
10 $UCX_HOME/lib/libucp.so.0(ucp_ep_close_nb+0x44) [0xf147d0a4aa44]

when returning INPROGRESS, and comp!=NULL, need to call the completion callback when the ep is flushed

Also, regarding your comment about calling completion callback when returning INPROGRESS, looking at semantics of ep_flush, we are required to invoke completion callback if return is not UCS_INPROGRESS but if it is UCS_OK, right? At any rate, I'd appreciate if you feedback on how completion ought to be invoked if all operations are flushed.

 * @return UCS_OK              - No outstanding communications left.
 *         UCS_ERR_NO_RESOURCE - Flush operation could not be initiated. A subsequent
 *                               call to @ref uct_ep_pending_add would add a pending
 *                               operation, which provides an opportunity to retry
 *                               the flush.
 *         UCS_INPROGRESS      - Some communication operations are still in progress.
 *                               If non-NULL 'comp' is provided, it will be updated
 *                               upon completion of these operations.
 */
UCT_INLINE_API ucs_status_t uct_ep_flush(uct_ep_h ep, unsigned flags,
                                         uct_completion_t *comp)
{
    return ep->iface->ops.ep_flush(ep, flags, comp);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants