Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Obtain GPU Device Memory Values Directly in TensorRT 10.3? #4096

Closed
ryukh4520 opened this issue Aug 26, 2024 · 2 comments
Closed

How to Obtain GPU Device Memory Values Directly in TensorRT 10.3? #4096

ryukh4520 opened this issue Aug 26, 2024 · 2 comments

Comments

@ryukh4520
Copy link

Description

Hello,

I have a question regarding the handling of GPU device memory in TensorRT 10.3. Here is the situation I'm facing:

Context:

  • TensorRT 8: When using TensorRT 8, we could execute inference entirely on the GPU using the execute_async_v2 function. By passing the pointers of input and output tensors, all operations, including the inference results, were handled directly on the GPU.

  • TensorRT 10: In TensorRT 10, with the introduction of execute_async_v3, we can no longer allocate bindings in the same way. Now, we need to set up host/device memory for inputs and outputs and use cuda.memcpy to transfer the results between them.

  • Current Process: For input tensors, we can directly pass a pointer to a specific input tensor to avoid additional copy operations, allowing inference to proceed without copying data back and forth between host and device memory. However, for the output, we currently perform a device-to-host memory copy to retrieve the inference results.

`

allocate input / output mem codes

    for i in range(self.engine.num_io_tensors):
        tensor_name = self.engine.get_tensor_name(i)
        engine_input = self.engine.get_tensor_shape(tensor_name)
        
        size = trt.volume(engine_input)
        dtype = trt.nptype(self.engine.get_tensor_dtype(tensor_name))

        host_mem = cuda.pagelocked_empty(size, dtype=dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)

        self.bindings.append(int(device_mem))

        if self.engine.get_tensor_mode(tensor_name) == trt.TensorIOMode.INPUT:
            self.inputs.append({'name': tensor_name, 'host': host_mem, 'device': device_mem})
        
        else:
            self.outputs.append({'name': tensor_name, 'host': host_mem, 'device': device_mem, 'shape': engine_input})

# set device mem to tensor pointer
 self.inputs[0]["device"] = int(some_tensor.data_ptr())

`

Question:
Is there a way in TensorRT 10.3 to obtain the inference results directly using the device memory pointer, without the need for an additional cuda.memcpy operation to transfer the data back to host memory? Essentially, I'm looking for a method to access the results directly from the GPU device memory.

Thank you for your assistance!

Environment

  • torch: 2.0.1
  • numpy: 1.26.4
  • python: 3.10.14

TensorRT Version:

  • tensorrt: 10.3.0
  • tensorrt-cu12: 10.3.0
  • tensorrt-cu12-bindings: 10.3.0
  • tensorrt-cu12-libs: 10.3.0

NVIDIA GPU:

  • Tesla V100-PCIE-32GB

NVIDIA Driver Version:

  • 535.183.01

CUDA Version:

  • 12.2

Operating System:

  • ubuntu
@ryukh4520
Copy link
Author

it works same as inputs.
just assign data_ptr address to outputs[0]["device"]

`

  • make empty tensor for output
    -output_shape = [3, 224,224]
    output_tensor = torch.empty([3 * 224 * 224], dtype=somedtype, device ="cuda")

outputs[0]["device"] = int(output_tensor.data_ptr())

  • after inference, result value is assign to output_tensor
    `

@jinhonglu
Copy link

#4330

I face a problem that when I assign the pointer with cuda memory, I got a different result compared to copying input data from numpy array.

Did you face a similar problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@jinhonglu @ryukh4520 and others