Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable graph mode for LLM inference #89

Open
xduzhangjiayu opened this issue Jul 5, 2024 · 9 comments
Open

Enable graph mode for LLM inference #89

xduzhangjiayu opened this issue Jul 5, 2024 · 9 comments

Comments

@xduzhangjiayu
Copy link
Contributor

xduzhangjiayu commented Jul 5, 2024

Hi,
I have read the "examples\NPU compilation tutorial.ipynb" about graph mode and eager mode, which helped me a lot.
I was wondering if I could use graph mode in LLM inference to reduce the weights copying between CPU and NPU.
So i simply changed the return value of function horizontal_fusion_linear into return fx_model.to('npu'), after converting the model, the inference error is:
AttributeError: 'Tensor' object has no attribute 'is_contiguous'
It seems this operation cannot be performed in the NPU?If i want to use graph mode in LLM inference, the above change is correct?

Any comment or advice is appreciated, thanks !

@alessandropalla
Copy link
Contributor

Hi,
we are working toward that as well. For example, please look at #84 for a tentative implementation of it for Phi3MLP layer. We are also waiting for OpenVINO remote tensors feature, that would bring almost performance parity between graph and kernel mode

@xduzhangjiayu
Copy link
Contributor Author

xduzhangjiayu commented Jul 11, 2024 via email

@xduzhangjiayu
Copy link
Contributor Author

xduzhangjiayu commented Jul 11, 2024 via email

@alessandropalla
Copy link
Contributor

I think it depends on the implementation, we found that using vanilla .to method doesn't produce quantized models with the right acceleration for the NPU and we are working on it. The memory increase is due to this + the fact that MLP are compiled once for the first inference and another time for the n+1 inference. This is because it has different shape. This is why kernel mode is so important for LLM inference and remote tensors, that allows the weights to be already allocated to the NPU are a crucial performance step in that direction

@alessandropalla
Copy link
Contributor

You are welcome, I'm happy to help

@xduzhangjiayu
Copy link
Contributor Author

xduzhangjiayu commented Jul 11, 2024 via email

@xduzhangjiayu
Copy link
Contributor Author

Hi,
The current version it seems can optimize separate phi3-MLP layer using to("npu"), i was curious if we can use to("npu") only for MLP layer when inference a entire llm model for speed up? For the implementation of this idea, will there be some limitations?(e.g. OpenVINO backend or NPU hardware)

@alessandropalla
Copy link
Contributor

We are working on this by using remote tensors (WIP PR here: #97) That would help in removing all overhead. The end goal would be to use .to('npu') like you do using CUDA to move tensors and models to the NPU

@xduzhangjiayu
Copy link
Contributor Author

That will be great if we can load the entire model into the NPU by using remote tensor, thanks for the reply !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants