We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
作者您好,看了您的论文深受启发,觉得您写的很好,有两个问题想咨询您。 1、我已经成功复现了代码,预训练模型使用的vit-l-14,两张4090显卡跑的结果是:top1: 95.3%\top5: 99.2%,跟您的结果可能还有差距。 2、关于视觉特征和文本特征融合时,您采用了CLIP模型默认的余弦相似度计算,但我不太理解这个代码思路,看CLIP原论文伪代码好像不是这样,恳请您解答一下这个logit_scale 是干啥的,有什么用,为什么要这样初始化logit_scale 。 self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07)) logit_scale = self.logit_scale.exp() logits = logit_scale * image_emb @ text_emb.t()
The text was updated successfully, but these errors were encountered:
感谢对我们工作的兴趣。
Sorry, something went wrong.
我是在ucf101数据集上复现的
No branches or pull requests
作者您好,看了您的论文深受启发,觉得您写的很好,有两个问题想咨询您。
1、我已经成功复现了代码,预训练模型使用的vit-l-14,两张4090显卡跑的结果是:top1: 95.3%\top5: 99.2%,跟您的结果可能还有差距。
2、关于视觉特征和文本特征融合时,您采用了CLIP模型默认的余弦相似度计算,但我不太理解这个代码思路,看CLIP原论文伪代码好像不是这样,恳请您解答一下这个logit_scale 是干啥的,有什么用,为什么要这样初始化logit_scale 。
self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
logit_scale = self.logit_scale.exp()
logits = logit_scale * image_emb @ text_emb.t()
The text was updated successfully, but these errors were encountered: