Skip to content

📤 Add export task (coreml and tflite) #174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

ramonhollands
Copy link
Contributor

@ramonhollands ramonhollands commented Feb 20, 2025

This pull requests adds a new export task including the option to export coreml and tflite format.

Use:

python yolo/lazy.py task=export name=ExportCoreml model=v9-s task.format=coreml
python yolo/lazy.py task=export name=ExportTflite model=v9-s task.format=tflite

Next to this it adds the option to use the FastModelLoader again during inference.

python yolo/lazy.py task=inference name=TfliteInference device=cpu model=v9-s task.nms.min_confidence=0.1 task.fast_inference=tflite use_wandb=False task.data.source=demo/images/test.jpg

Tflite export depends on ai_edge_torch which requires Python3.10

Next steps would be to add quantization and auto install missing modules

@ramonhollands ramonhollands changed the title 📤 Add export task 📤 Add export task (coreml and tflite) Feb 20, 2025
@henrytsui000
Copy link
Member

Hi,
I notice a change in yolo.py, which converts enumerations to an explicit counter (idx). We have updated our forward function with two actions: shortcut (directly obtaining an output from a middle layer) and external (inputting the model with other external sources) tensors.

Can you check if it is still able to run in this modification?

Henry Tsui

@ramonhollands
Copy link
Contributor Author

Hi Henry,
I merged the latest Main branch inside the 'add-export-task' branch and can confirm it still works correctly.
Best regards,
Ramon

@pzoltowski
Copy link

I did try this PR. I think there is error in:

if self.format == "coreml":
            export_mode = True

should be:

if format == "coreml":
            export_mode = True

Also did you manage to get good performance using ct.ComputeUnit.CPU_AND_NE in xcode benchmark? I'm getting this exported to CoreML model ~15x slower than similar CoreML model from hf or ultralytics. Also around 10x slower than when exported to onnx and executed on CoreML provider. Somehow it doesn't want to execute any operation on ANE. I did try many different tweak and settings but no luck.

@ramonhollands
Copy link
Contributor Author

Yeah, you are right about the change.

I'm still looking into the slowness. When skipping the export layers in 'if self.export_mode:' it's using ANE and is super fast.

If you can help me in debugging that would be great.

Eg, what code is hf or ultralytics using for that decoding layers?

        if self.export_mode:

            preds_cls, preds_anc, preds_box = [], [], []
            for layer_output in output["Main"]:
                pred_cls, pred_anc, pred_box = layer_output
                preds_cls.append(pred_cls.permute(0, 2, 3, 1).reshape(pred_cls.shape[0], -1, pred_cls.shape[1]))
                preds_anc.append(
                    pred_anc.permute(0, 3, 4, 1, 2).reshape(pred_anc.shape[0], -1, pred_anc.shape[2], pred_anc.shape[1])
                )
                preds_box.append(pred_box.permute(0, 2, 3, 1).reshape(pred_box.shape[0], -1, pred_box.shape[1]))

            preds_cls = torch.concat(preds_cls, dim=1).to(x[0][0].device)
            preds_anc = torch.concat(preds_anc, dim=1).to(x[0][0].device)
            preds_box = torch.concat(preds_box, dim=1).to(x[0][0].device)

            strides = self.get_strides(output["Main"], input_width)
            anchor_grid, scaler = self.generate_anchors([input_width, input_height], strides)  #
            anchor_grid = anchor_grid.to(x[0][0].device)
            scaler = scaler.to(x[0][0].device)
            pred_LTRB = preds_box * scaler.view(1, -1, 1)
            lt, rb = pred_LTRB.chunk(2, dim=-1)
            preds_box = torch.cat([anchor_grid - lt, anchor_grid + rb], dim=-1)

            return preds_cls, preds_anc, preds_box

@pzoltowski
Copy link

Thanks I confirmed after switching to export_mode = False and removing outputs=outputs, from ct.convert() inference is very fast and on ANE. I also tried to see what happens if switch to tracing method (exported_program = torch.jit.trace(self.model, example_input, strict=False)) instead of exported_program = torch.export.export(self.model, (example_input,)) - but MIT compiler failes in this mode.

I wish to help but I'm afraid I'm not skilled enough to know how it works. I guess this is the final layer that doing something similar to NMS and if done outside of ML model (post inference) then whole pipeline would be slow as well?

I only found this repo and issue that might be helpful (includig comments and related linked there pocketpixels/yolov5 repo) Not sure though if architecture is very different between yolov5 and yolov9.

https://gitlab.com/ultralytics/yolov5/-/merge_requests/7263

@pzoltowski
Copy link

Asked gemini 2.5 pro about this - cannot verify but it suggested that graph is not static and suggest precomputing anchors outside and pass as input:

3. Why it Fails on ANE / Becomes Slow

Dynamic Shapes & Operations: The core problem is the dynamic calculation of anchors (generate_anchors) and the dependency of tensor shapes (like the size N in preds_cls, preds_box) on the input image dimensions (input_width, input_height). ANE requires static computation graphs with fixed tensor shapes known at compile time (when converting to Core ML). Operations like torch.arange, torch.meshgrid, and shape calculations based on x.shape within the forward pass make the graph dynamic and prevent ANE execution.

Unsupported Ops: While basic operations are often supported, the specific combination or certain dynamic operations might trigger fallbacks to the CPU or GPU, negating ANE benefits. The anchor generation is the most likely culprit.
Missing NMS: Even if this decoding could run on ANE, it's incomplete. You still need NMS, which is computationally intensive. If NMS is done outside the model on the CPU afterwards, it remains a bottleneck.

4. Solution: Leverage Core ML's Native Capabilities
The goal is to have the entire pipeline (inference + decoding + NMS) run efficiently, ideally using ANE. The standard approach for Apple devices, as hinted by the YOLOv5 CoreML example, is:

Export a "Simpler" PyTorch Model: Export a version of the model that outputs raw or slightly processed predictions, but crucially, without the dynamic anchor generation and decoding logic inside forward. The computation graph must be static.

Convert to Core ML: Convert this simplified PyTorch model to a basic Core ML model.
Add Decoding and NMS Layers within Core ML: Use coremltools to modify the Core ML model's specification (.mlmodel file) by adding native Core ML layers to perform the decoding and NMS. Core ML has built-in, optimized layers for NMS.

Proposed Steps:
Define Fixed Export Size: Choose a fixed input size (e.g., [640, 640]) for which you will export the model. ANE works best with fixed sizes.
Precompute Anchors: For this fixed size, precompute the anchor_grid and scaler tensors outside the forward pass. Treat them as constants.
Create an Export Wrapper Module: This module will contain the decoding logic, but using the precomputed constants.

full LLM markdown output here:
gemni2.5pro_yolov9_slow_on_ane_output_postprocessing.md.txt

@pzoltowski
Copy link

pzoltowski commented Apr 20, 2025

EDIT: My bad this is irrelevant - I tested benchmarked wrong model - this below don't fix - is still slow.

I did another test with commenting out some lines in yolo.py:

            preds_cls = torch.concat(preds_cls, dim=1).to(x[0][0].device)
            preds_anc = torch.concat(preds_anc, dim=1).to(x[0][0].device)
            preds_box = torch.concat(preds_box, dim=1).to(x[0][0].device)

            strides = self.get_strides(output["Main"], input_width)
            
            # anchor_grid, scaler = self.generate_anchors([input_width, input_height], strides)  #
            # anchor_grid = anchor_grid.to(x[0][0].device)
            # scaler = scaler.to(x[0][0].device)
            # pred_LTRB = preds_box * scaler.view(1, -1, 1)
            # lt, rb = pred_LTRB.chunk(2, dim=-1)
            # preds_box = torch.cat([anchor_grid - lt, anchor_grid + rb], dim=-1)

            return preds_cls, preds_anc, preds_box

and this still runs very fast so probably gemini is right that the bottleneck code is :
anchor_grid, scaler = self.generate_anchors([input_width, input_height], strides)

@ramonhollands
Copy link
Contributor Author

I did try to export with different combinations as well, even made stuff 'static' by using

B, C, H, W = pred_cls.shape
pred_cls = pred_cls.contiguous().view(B, C, H * W).transpose(1, 2)

Instead of

preds_cls.append(pred_cls.permute(0, 2, 3, 1).reshape(pred_cls.shape[0], -1, pred_cls.shape[1]))

Next to that, pred_anc is not used in the post process code anywhere, so we can skip that.

No luck yet but I'm currently busy with other projects so I will have a look at in a couple of weeks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants