Regarding llama3-70b-instruct #1864

chintanshrinath · 2024-05-06T11:13:18Z

Dear
I am trying to load full model on A100-80 GB of 8 cores using below command.
docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --max-input-length 8000 --max-total-tokens 8010

However, it is not using all GPU core.
I also looked num_shard, but didn't get it.

Can you help here to to use all core and optimize the above command. The main concern is that we need to decrease inference time for production grade.
Thanks

github-actions · 2024-06-06T01:47:29Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions bot added the Stale label Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding llama3-70b-instruct #1864

Regarding llama3-70b-instruct #1864

chintanshrinath commented May 6, 2024

github-actions bot commented Jun 6, 2024

Regarding llama3-70b-instruct #1864

Regarding llama3-70b-instruct #1864

Comments

chintanshrinath commented May 6, 2024

github-actions bot commented Jun 6, 2024