Skip to content

TGI : Llama2 : Counting input and generated tokens and token per second #1426

Answered by ansSanthoshM
ansSanthoshM asked this question in Q&A
Discussion options

You must be logged in to vote

Thanks @Narsil

I will ask langchain people about option to get complete server response and response header using HuggingFaceTextGenInference.

I can get the info that i was looking for using requests.post method. Thanks for the heard up.

import requests
import time

headers = {
"Content-Type": "application/json",
'accept': 'application/json'
}

data = {
'inputs': 'What is Deep Learning?',
'parameters': {
'max_new_tokens': 20,
'details': True,
"decoder_input_details": True,
},
}
start_time=time.time()
response = requests.post('http://0.0.0.0:8082/generate', headers=headers, json=data)
end_time=time.time()
Time_taken=end_time-start_time
print(response.content)
print(response.json())
print(re…

Replies: 2 comments 2 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
2 replies
@Narsil
Comment options

@ansSanthoshM
Comment options

Answer selected by ansSanthoshM
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants