[SYCL] Fix batched impl for NVidia GPU #6164

AidanBeltonS · 2024-03-19T16:11:05Z

This PR fixes the failing batched_gemm call. oneMKL open source does not support combined halfs and floating points right now.
This change uses all halfs for the computation, this resolves failing tests in the Nvidia build.

I am currently in the process of adding these dtypes feature to oneMKL and this change can be reverted.

Tested on A100 and PVC

AidanBeltonS · 2024-03-19T16:11:12Z

@abhilash1910, @NeoZhangJianyu, @OuadiElfarouki, cc

OuadiElfarouki

Thanks for the fix. LGTM!

NeoZhangJianyu · 2024-03-20T06:40:48Z

This PR fixes the failing batched_gemm call. oneMKL open source does not support combined halfs and floating points right now. This change uses all halfs for the computation, this resolves failing tests in the Nvidia build.

I am currently in the process of adding these dtypes feature to oneMKL and this change can be reverted.

Tested on A100 and PVC

This code reverts the code change of mine: #5895
My PR is used to increase the performance on Intel GPU. Because fp32 is better than fp16 according to the test result.
And that will make the CI/unit-test passed - smaller NMS (I can't remember clearly).

I think this PR will reduce performance on Intel GPU.

Please verify the performance by llama-bench on Intel Arc 770/Flex 170 and MTL iGPU.
Please run CI/unit-test locally. make sure 100% pass on both GPUs.

NeoZhangJianyu · 2024-03-20T07:43:15Z

The PR uses mix fp32 and fp16 to increase performance of fp32 mode.
If user accept fp16, why not use fp16 mode directly? It has better performance.

It will reduce the accuracy due to involve fp16.
I don't think it's good solution to improve fp32 mode performance.

AidanBeltonS · 2024-03-20T10:55:14Z

The PR uses mix fp32 and fp16 to increase performance of fp32 mode. If user accept fp16, why not use fp16 mode directly? It has better performance.

It will reduce the accuracy due to involve fp16. I don't think it's good solution to improve fp32 mode performance.

Yes, this will decrease performance for iGPUs. However, oneMKL does not support the mixed datatypes for gemm_batch currently with Nvidia and AMD targets. So right now this asserts with an error of unsupported combination. I am working on adding mixed data types support currently and hope to have a long term fix up-streamed in oneMKL.

I propose taking the less optimal, but more generic path, to maintain support for all targets temporarily and keep the relevant tests passing. Once oneMKL is updated then we can revert this change and have the benefits you are describing.

NeoZhangJianyu · 2024-03-21T00:38:17Z

The PR uses mix fp32 and fp16 to increase performance of fp32 mode. If user accept fp16, why not use fp16 mode directly? It has better performance.
It will reduce the accuracy due to involve fp16. I don't think it's good solution to improve fp32 mode performance.

Yes, this will decrease performance for iGPUs. However, oneMKL does not support the mixed datatypes for gemm_batch currently with Nvidia and AMD targets. So right now this asserts with an error of unsupported combination. I am working on adding mixed data types support currently and hope to have a long term fix up-streamed in oneMKL.

I propose taking the less optimal, but more generic path, to maintain support for all targets temporarily and keep the relevant tests passing. Once oneMKL is updated then we can revert this change and have the benefits you are describing.

Great!
Currently, Intel iGPU is the main target of SYCL backend.
So, we have to pay attention to the impact on iGPU.

NeoZhangJianyu · 2024-03-21T00:49:48Z

I think we should provide a good framework/rule to handle this case: different optimization for different hardware structure.

Here is my proposal:

use #ifdef xxx
Static build. Good performance, clean running code path for trouble shooting.
only support 1 vendor devices in same time.
use if (hardware is xxx)
dynamic switch, a little performance decrease, more check of code path for trouble shooting.
support mix vendor devices in same time.

For alternative 2.
Currently, the build/cmake only support build SYCL for 1 vendor devices, alternative 2 can't support mix vendor feature in fact.
So, looks like only alternative 1 is available.

Invite @ggerganov @slaren

slaren · 2024-03-21T01:04:37Z

Just some thoughts: my goal in the long term is to move backends to dynamic libraries that can be loaded at runtime. Then it will be possible to use multiple backends simultaneously, and it would be possible use CUDA for an NVIDIA device, and SYCL for an Intel device at the same time. So there is no need for the SYCL backend to support every vendor, if there is already a backend available for that vendor that may provide better performance. I would be good of course for the SYCL backend to work with every device just to have more options, but the effort may be better spent improving the performance with Intel devices.

NeoZhangJianyu · 2024-03-21T01:42:59Z

Just some thoughts: my goal in the long term is to move backends to dynamic libraries that can be loaded at runtime. Then it will be possible to use multiple backends simultaneously, and it would be possible use CUDA for an NVIDIA device, and SYCL for an Intel device at the same time. So there is no need for the SYCL backend to support every vendor, if there is already a backend available for that vendor that may provide better performance. I would be good of course for the SYCL backend to work with every device just to have more options, but the effort may be better spent improving the performance with Intel devices.

Yes, I think the long-term idea is good.
So, alternative 1 is good.

@AidanBeltonS maybe you could use alternative 1 to handle your code for NV/AMD.

abhilash1910 · 2024-03-21T03:05:43Z

I think simple ifdef macro branching can be quite useful here , keeping the rest of the code performant with igpu.

NeoZhangJianyu · 2024-03-21T05:28:34Z

Mix fp32 and fp16 still have accuracy loss issue.
Is there any test to approve no impact to result?

AidanBeltonS · 2024-03-21T11:59:37Z

Iv protected the previous approach using a backend check at runtime. This I think is cleaner than using a macro and checking the backend is a cheap operation so there should be no noticeable drop in performance.

It has been checked on the Intel + NVidia GPUs.

AidanBeltonS · 2024-03-21T12:00:39Z

Mix fp32 and fp16 still have accuracy loss issue. Is there any test to approve no impact to result?

All tests are passing with the approach so any possible issues with precision do not seem to be significant

ggml-sycl.cpp

abhilash1910 · 2024-03-27T05:50:43Z

Can be merged. CI failure not related to SYCL. cc @ggerganov

* Fix batched impl * Maintain previous behaviour for igpu * retrigger CI --------- Co-authored-by: Abhilash Majumder <[email protected]>

OuadiElfarouki approved these changes Mar 19, 2024

View reviewed changes

AidanBeltonS added 2 commits March 21, 2024 10:58

Fix batched impl

10aef36

Maintain previous behaviour for igpu

d1cb8fe

AidanBeltonS force-pushed the use_half_batched branch from 421da29 to d1cb8fe Compare March 21, 2024 11:57

abhilash1910 reviewed Mar 21, 2024

View reviewed changes

ggml-sycl.cpp Show resolved Hide resolved

abhilash1910 reviewed Mar 21, 2024

View reviewed changes

ggml-sycl.cpp Show resolved Hide resolved

abhilash1910 approved these changes Mar 27, 2024

View reviewed changes

retrigger CI

cec6481

abhilash1910 requested a review from ggerganov March 27, 2024 06:43

ggerganov approved these changes Mar 27, 2024

View reviewed changes

abhilash1910 merged commit e82f9e2 into ggerganov:master Mar 27, 2024
55 of 56 checks passed

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

[SYCL] Fix batched impl for NVidia GPU (ggerganov#6164)

bc007ef

* Fix batched impl * Maintain previous behaviour for igpu * retrigger CI --------- Co-authored-by: Abhilash Majumder <[email protected]>

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024

[SYCL] Fix batched impl for NVidia GPU (ggerganov#6164)

9c9bd69

* Fix batched impl * Maintain previous behaviour for igpu * retrigger CI --------- Co-authored-by: Abhilash Majumder <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Fix batched impl for NVidia GPU #6164

[SYCL] Fix batched impl for NVidia GPU #6164

AidanBeltonS commented Mar 19, 2024 •

edited

AidanBeltonS commented Mar 19, 2024

OuadiElfarouki left a comment

NeoZhangJianyu commented Mar 20, 2024 •

edited

NeoZhangJianyu commented Mar 20, 2024

AidanBeltonS commented Mar 20, 2024

NeoZhangJianyu commented Mar 21, 2024

NeoZhangJianyu commented Mar 21, 2024 •

edited

slaren commented Mar 21, 2024 •

edited

NeoZhangJianyu commented Mar 21, 2024

abhilash1910 commented Mar 21, 2024

NeoZhangJianyu commented Mar 21, 2024

AidanBeltonS commented Mar 21, 2024

AidanBeltonS commented Mar 21, 2024

abhilash1910 commented Mar 27, 2024

[SYCL] Fix batched impl for NVidia GPU #6164

[SYCL] Fix batched impl for NVidia GPU #6164

Conversation

AidanBeltonS commented Mar 19, 2024 • edited

AidanBeltonS commented Mar 19, 2024

OuadiElfarouki left a comment

Choose a reason for hiding this comment

NeoZhangJianyu commented Mar 20, 2024 • edited

NeoZhangJianyu commented Mar 20, 2024

AidanBeltonS commented Mar 20, 2024

NeoZhangJianyu commented Mar 21, 2024

NeoZhangJianyu commented Mar 21, 2024 • edited

slaren commented Mar 21, 2024 • edited

NeoZhangJianyu commented Mar 21, 2024

abhilash1910 commented Mar 21, 2024

NeoZhangJianyu commented Mar 21, 2024

AidanBeltonS commented Mar 21, 2024

AidanBeltonS commented Mar 21, 2024

abhilash1910 commented Mar 27, 2024

AidanBeltonS commented Mar 19, 2024 •

edited

NeoZhangJianyu commented Mar 20, 2024 •

edited

NeoZhangJianyu commented Mar 21, 2024 •

edited

slaren commented Mar 21, 2024 •

edited