TGI : Llama2 : Counting input and generated tokens and token per second #1426
-
I am using TGI for Llama2 70B model as below. Is there anyway to get number of tokens in input, output text, also number of token per second (this is available in docker container LLM server output) from this python code. Is there anyway to call tokenize from TGi ?
Generated output :
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
You can use tok/s are in the response headers. Although, if you really care about these numbers, your shoudl most likely use our prometheus |
Beta Was this translation helpful? Give feedback.
-
I can get the number of tokens in input by seeting decode_input_details= True in client.text_generation, as below LLM Answer: What is capital of Kenya? When i use it with HuggingFaceTextGenInference, i am getting a warning that WARNING! decode_input_details is not default parameter. llm= HuggingFaceTextGenInference( output=llm("What is capital of Germany", |
Beta Was this translation helpful? Give feedback.
Thanks @Narsil
I will ask langchain people about option to get complete server response and response header using HuggingFaceTextGenInference.
I can get the info that i was looking for using requests.post method. Thanks for the heard up.