Does HF TGI support Multi Node -Multi GPU server set up ? #1561

ansSanthoshM · 2024-02-14T13:15:20Z

ansSanthoshM
Feb 14, 2024

Hi Team,

I have two machines, each machine has 4 NVIDIA GPUs, each GPU has 4GB RAM, so each machine has 184GB of VRAM.
Two machines are made as a cluster, now the cluster has 8GPUs and total 368GB of VRAM.

Now i want to load two LLM models on these cluster 1) Llama2-70B-Chat 2)Llama2-70B-Code, Each of these LLM consume 168GB of VRAM, to load both the models i need total 336 GB of VRAM. So i am thinking to use MultiNode-MulitGPU configuration server i.e 2 nodes each node has 4 GPUs.

Is it possible to make TGI server on this cluster configuration ? So that i can create two docker container end points for each of the LLM but both share common harware.

ansSanthoshM · 2024-02-15T07:58:51Z

ansSanthoshM
Feb 15, 2024
Author

Please let me know developers comment for this. This would help me to decide next course of action.

0 replies

andresC98 · 2025-03-24T08:18:33Z

andresC98
Mar 24, 2025

Hi! we would like to know the same!

0 replies

MohannadEhabBarakat · 2025-05-18T16:27:31Z

MohannadEhabBarakat
May 18, 2025

I have it working on a single-node multi-GPU. Then, for scaling, I'm planning to run a load balancer over multiple instances.

6 replies

MohannadEhabBarakat May 19, 2025

I understand, but I'm not sure if this is practical. If you have one model on multiple nodes (shards of the same model are running on multiple nodes) then latency will be very high. We didn't benchmark this so I'm not sure how much latency will be added. What we have is 1 node/model. And we scale over multiple nodes to handle trafic with load balancer

andresC98 May 19, 2025

I agree on the latency aspect, however multi-node multi-gpu is the only plausible way to host big models (e.g. models with >200B params). Besides, for batch jobs, latency wouldn´t be an issue for example.

MohannadEhabBarakat May 20, 2025

I didn't test this, I'm just thinking out loud. I think with bigger models, the issue is getting worse. Now you need to transfer more data between the servers. Having it on one node is expensive (maybe 8xH200 = 1120 GB or AMD has a card with 200 GB). Yes, that's expensive but also running multiple GPUs with optics networking cards to get the network speed up is expensive.

The question here would be on multi node normal networking how much slower it is? Will it be able to finish the batch on time?

andresC98 May 20, 2025

That's a question that could only be answered when TGI supports it.
vLLM currently does support multi-node multi-gpu deployments using Ray https://docs.vllm.ai/en/latest/serving/distributed_serving.html#running-vllm-on-multiple-nodes

MohannadEhabBarakat May 20, 2025

amm interesting, so on vLLM, what is the latency difference?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Does HF TGI support Multi Node -Multi GPU server set up ? #1561

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Does HF TGI support Multi Node -Multi GPU server set up ? #1561

Uh oh!

Replies: 3 comments · 6 replies

Uh oh!

ansSanthoshM Feb 15, 2024 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 3 comments 6 replies

ansSanthoshM
Feb 15, 2024
Author