You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi guys, thanks for great work.
And I have some questions below.
Q1) Though I was able to check the inference speed for SOTA models, I couldn’t find quantitative results (e.g., mAP or similar metrics) for 8-bit quantized models beyond MobileNet in the tutorial (https://app.aihub.qualcomm.com/docs/hub/inference_examples.html). I could find PSNR between 32bit <-> 8bit for output logits on export code, but I’m more interested in task-specific metrics like mAP. Is there any resource where I can find such results for quantized SOTA models? Models incorporating Multi-Head Attention (MHA) would be very helpful.
Q2) I’m working on a custom multi-task model with MHA modules, but I’m encountering significant accuracy drops when quantizing from ONNX to QNN using the QNN SDKs. While I understand some accuracy degradation is expected, none of the techniques I’ve tried (e.g., AIMET AdaRound, QAT, PTQs with various schemes, etc.) have been effective for my model.
It seems that for simpler models like ViT, PTQ is sufficient, but for pre-compiled models like Stable Diffusion v1.5 quantized, it appears additional steps were taken locally before uploading to the hub. Could you clarify what specific techniques or processes were applied during the local pre-compilation stage to achieve better 8-bit accuracy?
ViT
model init -> compile on hub (where PTQ is implemented) -> profile
SD v1.5 quantized
model init -> upload a pre-compiled model on hub -> profile
Thanks!
The text was updated successfully, but these errors were encountered:
Hi guys, thanks for great work.
And I have some questions below.
Q1) Though I was able to check the inference speed for SOTA models, I couldn’t find quantitative results (e.g., mAP or similar metrics) for 8-bit quantized models beyond MobileNet in the tutorial (https://app.aihub.qualcomm.com/docs/hub/inference_examples.html). I could find PSNR between 32bit <-> 8bit for output logits on export code, but I’m more interested in task-specific metrics like mAP. Is there any resource where I can find such results for quantized SOTA models? Models incorporating Multi-Head Attention (MHA) would be very helpful.
Q2) I’m working on a custom multi-task model with MHA modules, but I’m encountering significant accuracy drops when quantizing from ONNX to QNN using the QNN SDKs. While I understand some accuracy degradation is expected, none of the techniques I’ve tried (e.g., AIMET AdaRound, QAT, PTQs with various schemes, etc.) have been effective for my model.
It seems that for simpler models like ViT, PTQ is sufficient, but for pre-compiled models like Stable Diffusion v1.5 quantized, it appears additional steps were taken locally before uploading to the hub. Could you clarify what specific techniques or processes were applied during the local pre-compilation stage to achieve better 8-bit accuracy?
ViT
model init -> compile on hub (where PTQ is implemented) -> profile
SD v1.5 quantized
model init -> upload a pre-compiled model on hub -> profile
Thanks!
The text was updated successfully, but these errors were encountered: