Skip to content

pranavgupta2603/CLIP-ViL-GradCAM

Repository files navigation

Implementing Grad-Cam in Clip-ViL for VQA

Heatmaps generated are essentially attention maps highlighting the important features that the CLIP-ViL model is taking into consideration for the given question.

For the question - What is next to the bottle? This is the generated image with the heatmap - Heatmap

Future Scope -

  • Given a video, select the most important frame using the generate heatmaps on each frame
  • Implement a benchmark VQA dataset like CLEVR dataset

Related Links

References

@article{shen2021much,
  title={How Much Can CLIP Benefit Vision-and-Language Tasks?},
  author={Shen, Sheng and Li, Liunian Harold and Tan, Hao and Bansal, Mohit and Rohrbach, Anna and Chang, Kai-Wei and Yao, Zhewei and Keutzer, Kurt},
  journal={arXiv preprint arXiv:2107.06383},
  year={2021}
}

About

An implemention of CLIP-ViL Gradcam for VQA tasks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published