TGI : Llama2 : Counting input and generated tokens and token per second #1426

ansSanthoshM · 2024-01-10T13:25:23Z

ansSanthoshM
Jan 10, 2024

I am using TGI for Llama2 70B model as below.

Is there anyway to get number of tokens in input, output text, also number of token per second (this is available in docker container LLM server output) from this python code.

Is there anyway to call tokenize from TGi ?

import os
import time
from langchain.llms import HuggingFaceTextGenInference
server="http://0.0.0.0:8082" # 70B Model

llm= HuggingFaceTextGenInference(
inference_server_url=server,
max_new_tokens=100,
top_k=10,
top_p=0.95,
typical_p=0.95,
temperature=0.1,
repetition_penalty=1.03,
)

output=llm("What is capital of Germany",
return_full_text=True,)
print(output)

Generated output :

What is capital of Germany?
The capital of Germany is Berlin. It is the largest city in Germany and is located in the northeastern part of the country. Berlin has a rich history and has been the capital of Germany since 1990, when East and West Germany were reunited. The city is known for its cultural and historical landmarks, such as the Berlin Wall, the Brandenburg Gate, and the Museum Island. It is also a major economic and political center in Europe and is home to

Answered by ansSanthoshM

Jan 11, 2024

Thanks @Narsil

I will ask langchain people about option to get complete server response and response header using HuggingFaceTextGenInference.

I can get the info that i was looking for using requests.post method. Thanks for the heard up.

import requests
import time

headers = {
"Content-Type": "application/json",
'accept': 'application/json'
}

data = {
'inputs': 'What is Deep Learning?',
'parameters': {
'max_new_tokens': 20,
'details': True,
"decoder_input_details": True,
},
}
start_time=time.time()
response = requests.post('http://0.0.0.0:8082/generate', headers=headers, json=data)
end_time=time.time()
Time_taken=end_time-start_time
print(response.content)
print(response.json())
print(re…

View full answer

Narsil · 2024-01-10T13:37:22Z

Narsil
Jan 10, 2024
Maintainer

You can use decode_input_details= True to get information about your input.

tok/s are in the response headers.

Although, if you really care about these numbers, your shoudl most likely use our prometheus /metrics route, which will aggregate everything for every metric. Single request numbers are too crude to be widely used.

0 replies

ansSanthoshM · 2024-01-10T14:04:13Z

ansSanthoshM
Jan 10, 2024
Author

I can get the number of tokens in input by seeting decode_input_details= True in client.text_generation, as below
from huggingface_hub import InferenceClient
client = InferenceClient(model="http://10.3.53.235:8082") # Llama2 70B
output=client.text_generation(prompt="What is capital of Kenya?",
details=True,
return_full_text=True,
decoder_input_details=True,
max_new_tokens=10,)
print(output)
print("LLM Answer: "+ output.generated_text)
print("LLM Generated Tokens : {}".format(output.details.generated_tokens))
print("Input Tokens : {}".format(len(output.details.prefill)))

LLM Answer: What is capital of Kenya?
The capital of Kenya is Nairobi
LLM Generated Tokens : 10
Input Tokens : 8

When i use it with HuggingFaceTextGenInference, i am getting a warning that WARNING! decode_input_details is not default parameter.
How to use it with HuggingFaceTextGenInference ?

llm= HuggingFaceTextGenInference(
inference_server_url=server,
max_new_tokens=100,
top_k=10,
top_p=0.95,
typical_p=0.95,
temperature=0.1,
repetition_penalty=1.03,
decode_input_details= True
)

output=llm("What is capital of Germany",
return_full_text=True)
print(output)

2 replies

Narsil Jan 10, 2024
Maintainer

You'd have to ask langchain people to know that.

ansSanthoshM Jan 11, 2024
Author

Thanks @Narsil

I will ask langchain people about option to get complete server response and response header using HuggingFaceTextGenInference.

I can get the info that i was looking for using requests.post method. Thanks for the heard up.

import requests
import time

headers = {
"Content-Type": "application/json",
'accept': 'application/json'
}

data = {
'inputs': 'What is Deep Learning?',
'parameters': {
'max_new_tokens': 20,
'details': True,
"decoder_input_details": True,
},
}
start_time=time.time()
response = requests.post('http://0.0.0.0:8082/generate', headers=headers, json=data)
end_time=time.time()
Time_taken=end_time-start_time
print(response.content)
print(response.json())
print(response.headers)
print("LLM Response :{}".format(response.json()['generated_text']))
print("Generated Tokens :{}".format(response.json()['details']["generated_tokens"]))
print("Input Tokens :{}".format(len(response.json()['details']["prefill"])))
print("Total Tokens :{}".format(int(response.json()['details']["generated_tokens"])+len(response.json()['details']["prefill"])))
print("Queue Time :{} ms".format(response.headers['x-queue-time']))
print("time-per-token :{} ms".format(response.headers['x-time-per-token']))
print("Inference time taken :{:.1f} sec".format(float(response.headers['x-inference-time'])/1000))
print("Total time taken :{:.1f} sec".format(Time_taken))
print("Finish Reason :{}".format(response.json()['details']["finish_reason"]))

Answer selected by ansSanthoshM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TGI : Llama2 : Counting input and generated tokens and token per second #1426

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

TGI : Llama2 : Counting input and generated tokens and token per second #1426

ansSanthoshM Jan 10, 2024

Replies: 2 comments · 2 replies

Narsil Jan 10, 2024 Maintainer

ansSanthoshM Jan 10, 2024 Author

Narsil Jan 10, 2024 Maintainer

ansSanthoshM Jan 11, 2024 Author

ansSanthoshM
Jan 10, 2024

Replies: 2 comments 2 replies

Narsil
Jan 10, 2024
Maintainer

ansSanthoshM
Jan 10, 2024
Author

Narsil Jan 10, 2024
Maintainer

ansSanthoshM Jan 11, 2024
Author