Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for MLprogram in ort_coreml #116

Merged
merged 8 commits into from
Nov 29, 2024

Conversation

yuygfgg
Copy link
Contributor

@yuygfgg yuygfgg commented Nov 5, 2024

Onnxruntime supports two Core ML execution providers: NeuralNetwork and MLProgram. The NeuralNetwork provider is the default choice as it supports a wider range of operators, but it does not support FP16 precision (so all nodes falls to CPUExecutionProvider).

The MLProgram provider, while newer and currently supporting fewer operators, does support FP16 and is under active development. (Recent GitHub PRs suggest that it will mature rapidly, adding tens of new operators). Although it might be slower now due to limited operator support, once it achieves comprehensive coverage, the potential CPU/GPU acceleration through FP16 could make it perform better than the NeuralNetwork provider.

In Onnxruntime, the ONNX model is converted to a Core ML model and saved to disk, which is then loaded via Apple's CoreML framework. By choosing FP16 inputs with the MLProgram provider, we can significantly reduce both memory and disk usage as the Core ML model will be stored in a more compact FP16 format. While the ANE always performs computations in FP16 internally regardless of input precision, making FP16 acceleration unnecessary for the neural engine itself, the storage benefits remain valuable.

Moreover, FP16 inputs may accelerate computations on GPU and CPU, as both support FP16 (though not enabled by default). However, the exact behavior of FP16 handling in Onnxruntime remains unclear due to its complex execution flow: ORT first decides which nodes to assign to CoreML, uses CPUExecutionProvider for the rest, and then CoreML further distributes its nodes among CPU, GPU, and ANE.


For more details on FP16 behavior, refer to this documentation: 16-bit precision in Core ML on ANE.

ML Program relevant PRs: microsoft/onnxruntime#19347 microsoft/onnxruntime#22068 microsoft/onnxruntime#22480 microsoft/onnxruntime#22710 and so on.

It enables fp16 computation on ANE, instead of allocating all to CPU. However, the MLprogram is not well-supported currently, supporting much less EPs than regular NeuralNetwork.
@yuygfgg
Copy link
Contributor Author

yuygfgg commented Nov 5, 2024

Need onnxruntime >= 1.20.0

@WolframRhodium
Copy link
Contributor

Interesting and thanks for the information.

@WolframRhodium
Copy link
Contributor

WolframRhodium commented Nov 6, 2024

please use snake case and place the ml_program param to the end of the param list

@yuygfgg
Copy link
Contributor Author

yuygfgg commented Nov 28, 2024

test on m2pro:

ml_program=1 + fp16=True: ANE 120% usage, 10.52fps
ml_program=1 + fp16=False: GPU 97% usage, 5.40fps
ml_program=0 + fp16=True: CPU 100% usage, I have no patience to wait
ml_program=0 + fp16=False: ANE 114% usage, 10.23fps

script:

import vapoursynth as vs
from vapoursynth import core
import vsmlrt

src = core.lsmas.LWLibavSource('/path/to/source').resize.Spline36(1920//2, 1080//2, format=vs.RGBS, matrix_in_s="709") # same performance if format=RGBH
fin = vsmlrt.Waifu2x(clip=src, noise=-1, scale=2, backend=vsmlrt.Backend.ORT_COREML(ml_program=1, fp16=True), model=vsmlrt.Waifu2xModel.anime_style_art_rgb) # anime_style_art_rgb uses simplest Ops, almost all supported by tested backends

fin.set_output(0)

@yuygfgg yuygfgg marked this pull request as ready for review November 28, 2024 13:07
@WolframRhodium WolframRhodium merged commit a2b1a88 into AmusementClub:master Nov 29, 2024
@WolframRhodium
Copy link
Contributor

Thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants