-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable graph mode for LLM inference #89
Comments
Hi, |
Thanks very much for the reply!
…---- Replied Message ----
| From | Alessandro ***@***.***> |
| Date | 07/11/2024 20:21 |
| To | ***@***.***> |
| Cc | ***@***.***>***@***.***> |
| Subject | Re: [intel/intel-npu-acceleration-library] Enable graph mode for LLM inference (Issue #89) |
Hi,
we are working toward that as well. For example, please look at #84 for a tentative implementation of it for Phi3MLP layer. We are also waiting for OpenVINO remote tensors feature, that would bring almost performance parity between graph and kernel mode
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
By the way,I've just tried the graph mode implementation for TinyLlama-1.1B INT4,only change the "Phi3MLP" to "LlamaMLP"
I found that the inference speed was not improved (already done a warm up) and during the inference the NPU memory is up to 4GB,i think was impossible for this case,do you any comment for this?
Finally,thanks a lot for your reply and patience,I've learned a lot from this project !
…---- Replied Message ----
| From | Alessandro ***@***.***> |
| Date | 07/11/2024 20:21 |
| To | ***@***.***> |
| Cc | ***@***.***>***@***.***> |
| Subject | Re: [intel/intel-npu-acceleration-library] Enable graph mode for LLM inference (Issue #89) |
Hi,
we are working toward that as well. For example, please look at #84 for a tentative implementation of it for Phi3MLP layer. We are also waiting for OpenVINO remote tensors feature, that would bring almost performance parity between graph and kernel mode
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I think it depends on the implementation, we found that using vanilla .to method doesn't produce quantized models with the right acceleration for the NPU and we are working on it. The memory increase is due to this + the fact that MLP are compiled once for the first inference and another time for the n+1 inference. This is because it has different shape. This is why kernel mode is so important for LLM inference and remote tensors, that allows the weights to be already allocated to the NPU are a crucial performance step in that direction |
You are welcome, I'm happy to help |
Sorry, I don't understand the “different shape” you have mentioned, I think the dimensions and weights of the MLP layer are the same during the inference ? So after the warm up the weights can be already allocated to NPU?
…---- Replied Message ----
| From | Alessandro ***@***.***> |
| Date | 07/11/2024 21:03 |
| To | ***@***.***> |
| Cc | ***@***.***>***@***.***> |
| Subject | Re: [intel/intel-npu-acceleration-library] Enable graph mode for LLM inference (Issue #89) |
You are welcome, I'm happy to help
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Hi, |
We are working on this by using remote tensors (WIP PR here: #97) That would help in removing all overhead. The end goal would be to use |
That will be great if we can load the entire model into the NPU by using remote tensor, thanks for the reply ! |
Hi,
I have read the "examples\NPU compilation tutorial.ipynb" about graph mode and eager mode, which helped me a lot.
I was wondering if I could use graph mode in LLM inference to reduce the weights copying between CPU and NPU.
So i simply changed the return value of function
horizontal_fusion_linear
intoreturn fx_model.to('npu')
, after converting the model, the inference error is:AttributeError: 'Tensor' object has no attribute 'is_contiguous'
It seems this operation cannot be performed in the NPU?If i want to use graph mode in LLM inference, the above change is correct?
Any comment or advice is appreciated, thanks !
The text was updated successfully, but these errors were encountered: