Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to retrieve the protocol response headers in InferenceClient #2281

Closed
fxmarty opened this issue May 14, 2024 · 4 comments
Closed

Ability to retrieve the protocol response headers in InferenceClient #2281

fxmarty opened this issue May 14, 2024 · 4 comments

Comments

@fxmarty
Copy link

fxmarty commented May 14, 2024

As per title, it would be helpful to be able to retrieve the header as is possible with curl --include.

Example of a useful response header:

(base) felix@azure-amd-mi300-dev-01:~$ curl 0.0.0.0:80/generate -X POST -d '{"inputs":"Today I am in Paris and","parameters":{"max_new_tokens": 3, "details": true}}' -H 'Content-Type: application/json' --include
HTTP/1.1 200 OK
content-type: application/json
x-compute-type: gpu+optimized
x-compute-time: 0.111191439
x-compute-characters: 23
x-total-time: 111
x-validation-time: 0
x-queue-time: 0
x-inference-time: 110
x-time-per-token: 36
x-prompt-tokens: 7
x-generated-tokens: 3
content-length: 318
access-control-allow-origin: *
vary: origin
vary: access-control-request-method
vary: access-control-request-headers
date: Tue, 14 May 2024 12:57:59 GMT
@Wauplin
Copy link
Contributor

Wauplin commented May 21, 2024

Hi @fxmarty, thanks for the feature request. Any suggestion on how this information should/could be returned in the current InferenceClient framework? Open to suggestions on that.

@Wauplin
Copy link
Contributor

Wauplin commented Jun 11, 2024

I'm closing this issue since no new details have been provided. @fxmarty Happy to reopen it if you want. Just let me know what would be your use case for such a feature so that we can figure out what's the best way of supporting it.

@Wauplin Wauplin closed this as completed Jun 11, 2024
@Wauplin Wauplin closed this as not planned Won't fix, can't repro, duplicate, stale Jun 11, 2024
@fxmarty
Copy link
Author

fxmarty commented Jun 12, 2024

@Wauplin the feature enabled by this is tracking of the response time from TGI from the client.

With stats like

x-compute-type: gpu+optimized
x-compute-time: 0.111191439
x-compute-characters: 23
x-total-time: 111
x-validation-time: 0
x-queue-time: 0
x-inference-time: 110
x-time-per-token: 36
x-prompt-tokens: 7
x-generated-tokens: 3

For example, in https://huggingface.co/spaces/fxmarty/tgi-mi300-demo-chat/blob/main/app.py, I wanted to use client.text_generation and give these stats to the user as well, but I couldn't unless using the rest API myself. Note that TGI does not give theses stats in the generate_stream endpoint.

@Wauplin
Copy link
Contributor

Wauplin commented Jun 12, 2024

@fxmarty thanks for the explanation! Any suggestion on how you would like this information to be returned in the current InferenceClient framework?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants