About quantized SOTA model accuracy in AI-HUB website #156

notou10 · 2025-02-04T11:13:34Z

Hi guys, thanks for great work.
And I have some questions below.

Q1) Though I was able to check the inference speed for SOTA models, I couldn’t find quantitative results (e.g., mAP or similar metrics) for 8-bit quantized models beyond MobileNet in the tutorial (https://app.aihub.qualcomm.com/docs/hub/inference_examples.html). I could find PSNR between 32bit <-> 8bit for output logits on export code, but I’m more interested in task-specific metrics like mAP. Is there any resource where I can find such results for quantized SOTA models? Models incorporating Multi-Head Attention (MHA) would be very helpful.

Q2) I’m working on a custom multi-task model with MHA modules, but I’m encountering significant accuracy drops when quantizing from ONNX to QNN using the QNN SDKs. While I understand some accuracy degradation is expected, none of the techniques I’ve tried (e.g., AIMET AdaRound, QAT, PTQs with various schemes, etc.) have been effective for my model.

It seems that for simpler models like ViT, PTQ is sufficient, but for pre-compiled models like Stable Diffusion v1.5 quantized, it appears additional steps were taken locally before uploading to the hub. Could you clarify what specific techniques or processes were applied during the local pre-compilation stage to achieve better 8-bit accuracy?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About quantized SOTA model accuracy in AI-HUB website #156

About quantized SOTA model accuracy in AI-HUB website #156

notou10 commented Feb 4, 2025 •

edited

Loading

About quantized SOTA model accuracy in AI-HUB website #156

About quantized SOTA model accuracy in AI-HUB website #156

Comments

notou10 commented Feb 4, 2025 • edited Loading

ViT

model init -> compile on hub (where PTQ is implemented) -> profile

SD v1.5 quantized

model init -> upload a pre-compiled model on hub -> profile

notou10 commented Feb 4, 2025 •

edited

Loading