gpu跑paraformer larger onnx模型时候，模型内部出现维度不匹配错误 #1821

zhu-gu-an · 2024-06-17T02:17:49Z

gpu跑paraformer larger onnx模型时候，模型内部出现维度不匹配错误
模型使用的是：https://modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-onnx/files
没有量化模型；
错误：

请问，这个和什么有关系？如何更好解决或者定位问题？

其他：使用cpu跑的时候，没有出现问题，gpu会偶发这个问题。

poor1017 · 2024-06-17T02:49:06Z

不知道是不是和拼batch有关？

willnufe · 2024-06-17T03:16:10Z

gpu跑paraformer larger onnx模型时候，模型内部出现维度不匹配错误模型使用的是：https://modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-onnx/files 没有量化模型；错误：

请问，这个和什么有关系？如何更好解决或者定位问题？

其他：使用cpu跑的时候，没有出现问题，gpu会偶发这个问题。

1. 问题

大概率应该是CIF部分的问题（CIF是for循环实现的）：

你在转ONNX的时候，我们希望的是for循环的层数是支持动态变换的，也就是适配动态输入；
但是，实际上转换的过程中，这个for循环的层数会固定为你输入数据在CIF中的实际大小，比如固定为24啥的；
而 decoder的输入实际上包含两个部分， encoder的输出 + predictor(CIF部分)的输出；
encoder 的输出并没有受到影响（应该是你的（1， 127， 512）部分），但是predictor受到CIF固定长度的影响，就成了 (1, 24, 1)，这两部分数据在 decoder中执行mul操作，才会出问题；
所以实际上问题不在decoder，而在predictor

2. 解决

你可以用netron 查看模型的predictor部分，是不是固定的长度；
把predictor的CIF部分换成并行方式，parallel cif ，自己搜下；
或者之前看到过onnx转换时支持 for循环动态维度的，但我没有试过，你也可以尝试下；

LauraGPT · 2024-06-17T06:26:45Z

可以先等2天，等我们gpu部署发出来

zhu-gu-an · 2024-06-19T08:12:24Z

请问发布的是c++版本的gpu部署方案吗？

zhu-gu-an · 2024-06-19T08:42:09Z

gpu跑paraformer larger onnx模型时候，模型内部出现维度不匹配错误模型使用的是：https://modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-onnx/files 没有量化模型；错误：
请问，这个和什么有关系？如何更好解决或者定位问题？
其他：使用cpu跑的时候，没有出现问题，gpu会偶发这个问题。

1. 问题

大概率应该是CIF部分的问题（CIF是for循环实现的）：

你在转ONNX的时候，我们希望的是for循环的层数是支持动态变换的，也就是适配动态输入；

但是，实际上转换的过程中，这个for循环的层数会固定为你输入数据在CIF中的实际大小，比如固定为24啥的；

而 decoder的输入实际上包含两个部分， encoder的输出 + predictor(CIF部分)的输出；

encoder 的输出并没有受到影响（应该是你的（1， 127， 512）部分），但是predictor受到CIF固定长度的影响，就成了 (1, 24, 1)，这两部分数据在 decoder中执行mul操作，才会出问题；

所以实际上问题不在decoder，而在predictor

2. 解决

你可以用netron 查看模型的predictor部分，是不是固定的长度；

把predictor的CIF部分换成并行方式，parallel cif ，自己搜下；

或者之前看到过onnx转换时支持 for循环动态维度的，但我没有试过，你也可以尝试下；

gpu运行时候，batchsize都是1，有时候gpu能够跑完整个测试集，循环几遍之后，在某次循环上，出现这个问题，具体看了torch整个forward代码，没找可怀疑的地方，反而感觉是onnxruntime导致显存不安全导致的问题。transpose， view操作导致的连续显存和非连续显存？我也尝试增加了contiguous，导出模型，偶现的几率变小了，也有可能测试环境的不同？？很疑惑

LauraGPT · 2024-06-27T09:33:00Z

请问发布的是c++版本的gpu部署方案吗？

是的，在写文档了。

zhu-gu-an added the question Further information is requested label Jun 17, 2024

LauraGPT assigned lyblsgo Jun 17, 2024

zhu-gu-an closed this as completed Jun 19, 2024

zhu-gu-an reopened this Jun 19, 2024

LauraGPT mentioned this issue Jun 27, 2024

有 flutter demo吗 #1840

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu跑paraformer larger onnx模型时候，模型内部出现维度不匹配错误 #1821

gpu跑paraformer larger onnx模型时候，模型内部出现维度不匹配错误 #1821

zhu-gu-an commented Jun 17, 2024

poor1017 commented Jun 17, 2024

willnufe commented Jun 17, 2024

LauraGPT commented Jun 17, 2024

zhu-gu-an commented Jun 19, 2024

zhu-gu-an commented Jun 19, 2024

1. 问题

2. 解决

LauraGPT commented Jun 27, 2024

gpu跑paraformer larger onnx模型时候，模型内部出现维度不匹配错误 #1821

gpu跑paraformer larger onnx模型时候，模型内部出现维度不匹配错误 #1821

Comments

zhu-gu-an commented Jun 17, 2024

poor1017 commented Jun 17, 2024

willnufe commented Jun 17, 2024

1. 问题

2. 解决

LauraGPT commented Jun 17, 2024

zhu-gu-an commented Jun 19, 2024

zhu-gu-an commented Jun 19, 2024

1. 问题

2. 解决

LauraGPT commented Jun 27, 2024